Hacker Read top | best | new | newcomments | leaders | about | bookmarklet login

When you need to memmove your whole vector you're going to thrash a lot more of the cache lines than when you touch 2 neighbor cache lines.


sort by: page size:

The basic speed up will not even be because of vectorisation. It will be because of caches.

Emphasis on 'largely'. If you have e.g. multiply indirect pointers, then you care. That said, I didn't mean to imply that moving to bigger, slower caches was the wrong tradeoff.

I only glanced at what you did but if this operating matrices or vectors, there are often crazy optimizations that can be made for CPU cache

If one process is using only 4 bytes out of every 64 byte cache line fill, the rest of the line is wasted regardless of whether there are hundreds of other processes loading different lines into the various levels of cache. It’s not using more of the cache, just using the same lines (or fewer) more effectively.

That sounds suspiciously as if an implementation tailored to that cache line size might see a considerable part of the speedup even running on non-SIMD operations? (or on less wide SIMD)

He is talking about L1 and L2 caches. They usually use more area than the CPUs, even them being only a few MB in size.

One thing that I've always wondered was why it seems like most cache line mappings seem to collide at powers of two -- since so much software uses power of two sized buffers, it seems like it would be a big win to bucket by some non power of two size.

I'd rather choose really big caches per core than really big number of cores.

You have to use at least one cache line for memcpy. But you need more for pipelining the accesses probably to get good speed. Benchmarking memcpy though is complicated.

Different cacheline sizes for the different cores seems like an absurdly bad idea. One because it opens one up to bugs like these, but also because it makes optimization a lot harder. I have a hard time believing the savings due to a larger line size are worth it.

> and separately the y,z,w triples are contiguous ([y1,z1,w1,y2,z2,w2,y3,z3,w3...]).

Wouldn't having ys, zs and ws occupying different cache lines be good enough? After all, the CPU only wants fast access to these data. Or maybe it's a hard thing to do for CPUs to fetch from 4 different lines at a time (+1 for the instructions)?


Cache lines, are like a thing, man. Unless you're on an embedded processor.

Then it turns from a cache miss problem to a memory bandwidth problem. If you are only using a small proportion of the cache line in your calculation, then you effectively multiply the data transferred to the CPU.

It's easy to buy memory, but hard to buy L2/L3 cache. The whole point of the exercise is to scale more easily on multicore architectures, but it's no good if you blow out the cache thousands of times per second and bottleneck the system on memory accesses.

Actually it is. Every cache hit increases the available bandwidth for the CPU. So generally the larger the caches the less bandwidth and the less sensitive you are to memory latency.

Seems like that would hurt memory locality, with a commensurate increase in cache-misses.

Thing is, you still need to zero out any cache line that might be caching those lines, which would conflict with the CPU accessing the cache. Might as well just let the cpu doing the zeroing.

There are tricks to pull chunks of memory into cache as is, no? Not that they are ideal.

Yup. When performance matters, don't chase pointers (references) and don't ever do random memory accesses if you can avoid it. Ensure variables needed about same time are near each other, so that a single cache line fill will get all. Except when other threads/cores frequently update a variable, try to keep those separately to reduce false sharing.

Remember you typically have just 512 L1 data cache lines to work with. Up to 1024, some CPUs might have just 256. Wrong data access patterns will continuously invalidate those.

Imagine processing an image stored in row-major format and reading pixels vertically. If the image is taller than 512 pixels and wider than 64 bytes, you're pretty much guaranteed to have continuous L1D cache misses. So swizzle [1] or at least divide image in smaller blocks first to improve L1D cache hit rate.

On top of that using SIMD is pretty beneficial.

Small nitpick: pretty much all of the CPU cores I can think of (representing 99% of cases, I think) have 64 byte cache lines, not 16 like this article suggests. The remaining have 32 byte cache lines. While they might exist, I don't know any modern CPU with 16 byte cache lines.

[1]: One common swizzling curve/algorithm: https://en.wikipedia.org/wiki/Z-order_curve

next

Legal | privacy