Even O(1) operations on lists can break cache locality. Since you're chaining using pointers, all bets that your cells are nearby are off. With vectors, they always are.
You are right and this is will know, but not only for the reason you state. Cache locality is so much better on an array/vector compared to linked list and that is really important.
Wait why do linked lists have bad cache locality? It depends on how your heap is set up. For example, you could have an allocator that gives relatively good locality by having high granularity and only bailing out if say your LL gets super long (so if your lists are usually short, they could have awesome locality)
You use them all the time in Haskell and OCaml. Cache locality isn't such an issue. You're not mallocing linked list nodes. You allocate by a pointer bump of the minor heap, and if the GC copies your list into the major heap, it's going to copy the list elements together.
You also use recursion all the time, and no, recursion is not generally straightforwardly optimised into iteration unless you're doing trivial tail recursion.
It's not just cache locality, but also at least the effectiveness of cache prefetching and the ability to use wider operations.. simple array loops are something compilers can often vectorize without assistance.
I'm a bit worried about the amount of pointer chasing here. Presumably on modern architectures that would kill cache locality? Or am I missing something?
He says because of cache misses due to each element being allocated individually.
But then working with references is bad in general? If your vector stores references to objects it's bad? Your object fields point to strings - bad too.
Should we ditch Lisp and Java?
Note, data should not be continuous to stay in cache. Cache is more like paged virtual memory, it consists of cache lines.
Before subscribing to the "never use linked lists" view I would like to see a nontrivial program rewritten from lists and references to the "allocate everything at once" approach and measure the performance.
If you want to avoid references and allocate everything in place, it may force your program to do a lot of copying.
Also, not only CPU performance matters - programmer performance is important too. So-called "high-level languages" try to optimise this.
Even this doesn't give you optimal locality. Objects in the same pool are closer than if they were randomly fragmented from a global allocator, but if you're accessing them in the list order, you're not accessing adjacent addresses to take advantage of memory prefetch. If the pool is large, it may not even fit in the cache.
When performance needs to be maximized, games switch to entity component systems and switch from arrays-of-structs to structs-of-arrays. This enables processing all objects as a vector, linearly from start to end, and often without needing to fetch any irrelevant bytes that aren't processed in a given pass. This sometimes also helps utilize SIMD for data spanning more than one entity, which you can't do when using linked lists.
The performance of vectors comes from iterating through them and letting the cpu prefetch items before you need them. Random access in a vector doesn’t really get you that if the vector is larger than your L1/L2 caches, which linked lists would be in anyway if you used them recently enough.
Even with perfect cache locality, a linked list will be significantly slower to traverse than an array because of worse pipelining and parallelization. Partially unrolled lists can help, but at the cost of additional complexity.
Of course not all container accesses require traversal.
Wouldn’t structs of arrays affect cache locality adversely, if say you’re say reading and processing the data in sequential order? You’d have to jump between memory addresses that are pretty far apart to read in a single record / set of fields, and potentially trigger cache misses since they’re likely spread out across different memory pages and you end up filling your L1/L2/L3 cache?
> pointer-jumping will throw away the (absolutely astonishingly large) benefits of the cache, unless you’re careful.
Right! Yes, that’s part of my point, STL::list isn’t being careful with cache, or with allocations. Really it just rarely makes sense to even compare STL::list to STL::vector as if they’re otherwise equal choice. Usually the choice is (or should be) driven by constraints, not by which has a slight perf edge, right? Inside a memory manager, use of a vector isn’t usually considered a choice. Maybe it’s possible to build a free page vector, but I think isn’t common, and people usually pay the costs of pointer chasing on the free list because there aren’t practical alternatives.
> the <vector> growth strategy does not free+malloc/reallocate on each insertion
Yeah very good point. Does STL::list do the same for the container of pointers? I don’t even know, but maybe it can’t, and maybe the primary perf advantage of STL::vector over STL::list is due to vector’s amortized mallocs?
Nitpick^2: You are probably saving a few cycles due to loop overhead around pointer offsets and bounds checking, + not having to load the items into cache twice, if the sequence is too long to fit into cache.
Problem is memory locality, or lack thereof. Having to chase pointers around and not having things laid out in sequence doesn't play well with the CPU cache.
It will depend on the allocation method. For example, you could use a pooled allocator that allocates a contiguous array of list items in a block, then use those in order as required. That would help with cached locality. IIRC, the Borland C++ STL implementation did something like this.
Don't forget that good cache locality also can cause data being pulled into cache that the prefetcher did know nothing about.
I can create you a shitty linked list that fits perfectly in L3, but still has terrible cold cache performance because each individual cacheline has to be pulled in one by one.
reply