Emphasis on 'largely'. If you have e.g. multiply indirect pointers, then you care. That said, I didn't mean to imply that moving to bigger, slower caches was the wrong tradeoff.
If one process is using only 4 bytes out of every 64 byte cache line fill, the rest of the line is wasted regardless of whether there are hundreds of other processes loading different lines into the various levels of cache. It’s not using more of the cache, just using the same lines (or fewer) more effectively.
That sounds suspiciously as if an implementation tailored to that cache line size might see a considerable part of the speedup even running on non-SIMD operations? (or on less wide SIMD)
One thing that I've always wondered was why it seems like most cache line mappings seem to collide at powers of two -- since so much software uses power of two sized buffers, it seems like it would be a big win to bucket by some non power of two size.
You have to use at least one cache line for memcpy. But you need more for pipelining the accesses probably to get good speed. Benchmarking memcpy though is complicated.
Different cacheline sizes for the different cores seems like an absurdly bad idea. One because it opens one up to bugs like these, but also because it makes optimization a lot harder. I have a hard time believing the savings due to a larger line size are worth it.
> and separately the y,z,w triples are contiguous ([y1,z1,w1,y2,z2,w2,y3,z3,w3...]).
Wouldn't having ys, zs and ws occupying different cache lines be good enough? After all, the CPU only wants fast access to these data. Or maybe it's a hard thing to do for CPUs to fetch from 4 different lines at a time (+1 for the instructions)?
Then it turns from a cache miss problem to a memory bandwidth problem. If you are only using a small proportion of the cache line in your calculation, then you effectively multiply the data transferred to the CPU.
It's easy to buy memory, but hard to buy L2/L3 cache. The whole point of the exercise is to scale more easily on multicore architectures, but it's no good if you blow out the cache thousands of times per second and bottleneck the system on memory accesses.
Actually it is. Every cache hit increases the available bandwidth for the CPU. So generally the larger the caches the less bandwidth and the less sensitive you are to memory latency.
Thing is, you still need to zero out any cache line that might be caching those lines, which would conflict with the CPU accessing the cache. Might as well just let the cpu doing the zeroing.
Yup. When performance matters, don't chase pointers (references) and don't ever do random memory accesses if you can avoid it. Ensure variables needed about same time are near each other, so that a single cache line fill will get all. Except when other threads/cores frequently update a variable, try to keep those separately to reduce false sharing.
Remember you typically have just 512 L1 data cache lines to work with. Up to 1024, some CPUs might have just 256. Wrong data access patterns will continuously invalidate those.
Imagine processing an image stored in row-major format and reading pixels vertically. If the image is taller than 512 pixels and wider than 64 bytes, you're pretty much guaranteed to have continuous L1D cache misses. So swizzle [1] or at least divide image in smaller blocks first to improve L1D cache hit rate.
On top of that using SIMD is pretty beneficial.
Small nitpick: pretty much all of the CPU cores I can think of (representing 99% of cases, I think) have 64 byte cache lines, not 16 like this article suggests. The remaining have 32 byte cache lines. While they might exist, I don't know any modern CPU with 16 byte cache lines.
reply