Hacker Read

Peaker · 2013-12-03 00:29:25+00:00

When you need to memmove your whole vector you're going to thrash a lot more of the cache lines than when you touch 2 neighbor cache lines.

temac | karma 2068 | avg karma 2.19 · | 2020-03-29 01:25:36

The basic speed up will not even be because of vectorisation. It will be because of caches.

moonchild | karma 2760 | avg karma 2.2 · | 2022-03-09 15:44:47

Emphasis on 'largely'. If you have e.g. multiply indirect pointers, then you care. That said, I didn't mean to imply that moving to bigger, slower caches was the wrong tradeoff.

cerved | karma 1227 | avg karma 1.33 · | 2021-06-21 01:20:13

I only glanced at what you did but if this operating matrices or vectors, there are often crazy optimizations that can be made for CPU cache

rrss | karma 2096 | avg karma 2.9 · | 2021-08-08 10:37:52

If one process is using only 4 bytes out of every 64 byte cache line fill, the rest of the line is wasted regardless of whether there are hundreds of other processes loading different lines into the various levels of cache. It’s not using more of the cache, just using the same lines (or fewer) more effectively.

usrusr | karma 10462 | avg karma 1.96 · | 2023-02-16 05:50:17

That sounds suspiciously as if an implementation tailored to that cache line size might see a considerable part of the speedup even running on non-SIMD operations? (or on less wide SIMD)

marcosdumay | karma 27273 | avg karma 1.67 · | 2018-11-10 17:51:38

He is talking about L1 and L2 caches. They usually use more area than the CPUs, even them being only a few MB in size.

ori_b | karma 9608 | avg karma 3.33 · | 2015-04-12 18:42:48+00:00

One thing that I've always wondered was why it seems like most cache line mappings seem to collide at powers of two -- since so much software uses power of two sized buffers, it seems like it would be a big win to bucket by some non power of two size.

qwerty456127 | karma 8748 | avg karma 1.93 · | 2019-09-20 19:50:37+00:00

I'd rather choose really big caches per core than really big number of cores.

justincormack | karma 11120 | avg karma 2.4 · | 2017-09-05 14:52:03+00:00

You have to use at least one cache line for memcpy. But you need more for pipelining the accesses probably to get good speed. Benchmarking memcpy though is complicated.

anarazel | karma 5793 | avg karma 3.95 · | 2016-09-12 13:05:13

Different cacheline sizes for the different cores seems like an absurdly bad idea. One because it opens one up to bugs like these, but also because it makes optimization a lot harder. I have a hard time believing the savings due to a larger line size are worth it.

miscmask | karma 1 | avg karma 1.0 · | 2021-05-05 13:06:20+00:00

> and separately the y,z,w triples are contiguous ([y1,z1,w1,y2,z2,w2,y3,z3,w3...]).

Wouldn't having ys, zs and ws occupying different cache lines be good enough? After all, the CPU only wants fast access to these data. Or maybe it's a hard thing to do for CPUs to fetch from 4 different lines at a time (+1 for the instructions)?

reply

greesil | karma 2499 | avg karma 2.83 · | 2023-11-19 22:59:08

Cache lines, are like a thing, man. Unless you're on an embedded processor.

mnw21cam | karma 5706 | avg karma 2.72 · | 2017-02-02 11:55:29+00:00

Then it turns from a cache miss problem to a memory bandwidth problem. If you are only using a small proportion of the cache line in your calculation, then you effectively multiply the data transferred to the CPU.

jandrese | karma 30121 | avg karma 3.36 · | 2016-12-01 04:18:52+00:00

It's easy to buy memory, but hard to buy L2/L3 cache. The whole point of the exercise is to scale more easily on multicore architectures, but it's no good if you blow out the cache thousands of times per second and bottleneck the system on memory accesses.

sliken | karma 5336 | avg karma 1.83 · | 2015-12-02 02:49:30

Actually it is. Every cache hit increases the available bandwidth for the CPU. So generally the larger the caches the less bandwidth and the less sensitive you are to memory latency.

rocqua | karma 9129 | avg karma 2.16 · | 2019-07-01 11:41:52+00:00

Seems like that would hurt memory locality, with a commensurate increase in cache-misses.

gpderetta | karma 12081 | avg karma 1.83 · | 2020-01-21 10:04:53+00:00

Thing is, you still need to zero out any cache line that might be caching those lines, which would conflict with the CPU accessing the cache. Might as well just let the cpu doing the zeroing.

opportune | karma 14424 | avg karma 3.95 · | 2019-05-04 03:32:56+00:00

There are tricks to pull chunks of memory into cache as is, no? Not that they are ideal.

vardump | karma 7011 | avg karma 2.24 · | 2015-07-02 19:18:06

Yup. When performance matters, don't chase pointers (references) and don't ever do random memory accesses if you can avoid it. Ensure variables needed about same time are near each other, so that a single cache line fill will get all. Except when other threads/cores frequently update a variable, try to keep those separately to reduce false sharing.

Remember you typically have just 512 L1 data cache lines to work with. Up to 1024, some CPUs might have just 256. Wrong data access patterns will continuously invalidate those.

Imagine processing an image stored in row-major format and reading pixels vertically. If the image is taller than 512 pixels and wider than 64 bytes, you're pretty much guaranteed to have continuous L1D cache misses. So swizzle [1] or at least divide image in smaller blocks first to improve L1D cache hit rate.

On top of that using SIMD is pretty beneficial.

Small nitpick: pretty much all of the CPU cores I can think of (representing 99% of cases, I think) have 64 byte cache lines, not 16 like this article suggests. The remaining have 32 byte cache lines. While they might exist, I don't know any modern CPU with 16 byte cache lines.

[1]: One common swizzling curve/algorithm: https://en.wikipedia.org/wiki/Z-order_curve

reply