Hacker Read top | best | new | newcomments | leaders | about | bookmarklet login

I wonder if it is worth it to create a hybrid data structure: A linked list of arrays. Say you'd tune it so each array fits in the L1 cache. You don't need a huge chunk of contiguous memory, but still gain most of the performance gains of arrays. Iteration gets a lot more complicated, of course.


sort by: page size:

I’ve considered making a vector sorta thing implemented as a linked list of arrays. Assuming you sized the arrays to fit well inside the CPU cache, are there really many tradeoffs?

Maybe there's a hybrid approach. For instance, I can imagine a rope-like data structure where the nodes are sized to a cache row for example. It might not be optimal from a memory usage standpoint, but there are probably use cases where it could make sense

I can 'efficiently modify the lists you started with' with a dynamic multi-type array and I have cache locality and random access. I have a single data structure with an 8 byte overhead, 2 byte overhead per node that can be used as a tuple array with direct lookup, a single linked list, double linked list, a queue, a stack, and a balanced binary search tree. This whole structure is allocated in a single contiguous memory chunk and is cache friendly. It can be serialized and un-serialized without any processing which would be required by a Lisp style linked list even if the list was just moved in memory, saved/restored to/from disk or transmitted to another computer.

Lisp DOES predate level 1 cache but we DO have level 1 cache now and it dwarfs all other optimizations on modern computers. Wouldn't it be nice if the hardware was designed to optimize our languages instead of the other way around but I live in this universe rather than an alternate one.

What matters is now, not decades ago.


Cache means inserts into arrays are often faster than linked lists. Remember RAM is really really big and slow.

You're misunderstanding the advice about contiguous. It's not that it's more likely to stay in cache, but if you're accessing data sequentially it's more likely the data you're going to access next is already in cache.

Most (all I've read/looked at) benchmarks in Java have data structures backed by linked lists utterly smashed by things implemented by an array.

There was in the last year or two a good c++ talk where a game or game engine changed their storage to be more array like and got roughly a 30% boost in performance. Memory locality is usually king, which is why linked lists are rarely used.


If really needed you can have it both ways, you could roll a data structure that is a linked list of arrays. Then you have constant time growth and good caching/prefetching.

Toy example:

  struct ListSegment {
    T[64] items
    int nextItem = 0
    ListSegment* nextSeg, prevSeg
  }

The question had 2 goals:

1) Do you think about cache at all or is it just something you heard mentioned as important that one time?

2) It's a good lead-in to discussing the effects of cache in algorithms. How that conversation goes helps me to understand how that person thinks and discusses complex problems.

A good answer would be "I'm not sure, but probably way, way slower because linked list can point all over memory but arrays cache really well."

An excellent, A+ answer would be "In the best case it might not be too much slower if you have an intrusive linked list is arranged sequentially in memory like an array like onekorg explained. But, in practice most will be 20-200x slower because they are usually implemented as pointers to nodes containing pointers to data and each piece is allocated piecemeal in a already fragmented heap. Uncached memory reads can take 100+ cycles and summing an int is not enough work to hide that even if the CPU speculatively prefetches."

I mainly expect a surprised reaction that they could be so slow and looked forward to the follow-up discussion.


Using array based data structures also get you more cache hits. For smaller data sets, this may well be faster than a fancy algorithm.

Yes, in the end I had to go to a custom hashmap/linked list combination to get the best performance for our specific use case (basically a flow table with LRU cache behavior and also expiration of flows by time limits).

I think it's also worth noting that this is specific to (common) CPU architecture where cache is sufficiently big and orders of magnitude faster than main memory, and where main memory reads and writes have similar cost.

The tradeoff in moving memory at the end of the array vs scanning to the point of linked list insertion could be different on some existing and some hypothetical systems where these concepts may be relevant. For example, storage write cycle minimization could be a priority in rare cases.

That said this is an excellent point about the majority of use cases. I hadn't considered the dominance of the time it takes to locate the element to modify. I've never had a reason to benchmark a linked list since even a basic idea of the effect of cache/prefetch precludes its use in every use case I've actually come across.

Probably useful for CS education to start comparing linked lists to bubble sort. There's always a better algorithm, but here's a simple one to learn so you understand how to compare data structure performance.


> the array is contiguous in memory - so operations like search or moving stuff around after addition make extremely good use of CPU's cache

Also - if I see someone try to use a linked list for an enormous data structure again.... Wow it does not scale worth crap because it turns out that the hardware is actually important, and contiguous memory is amazing.


At a guess, some understanding of this article[1] - if a CPU instruction scaled up to human time takes 1 second then a Level 1 cache lookup takes 2 seconds and a main memory lookup takes 4 minutes.

Imagine for the array it's 1 CPU instruction to load a value, 1 to load the next value, 1 to add them, and one to store the result, that would be 4 instructions per sum; ideally the array would stream into the CPU after a single main memory lookup delay up-front, and then be 4 instructions per pair, summed as fast as the CPU can loop.

The linked list at worst needs an imaginary 1 CPU instruction to load a value, 1 to load the pointer value, 1 to reference the pointer, a delay of 2 seconds to get that value from L1 cache - missed, it's not in cache - 240 seconds stalled waiting for main memory, 1 to add, 1 to store the result. Worst case, >240x slower.

The linked list is not guaranteed to be in contiguous memory, but it might be, so the cache might have the right data in it. The linked list is 50% data, 50% metadata, so the cache is half wasted / can hold half as much data, and if the linked list is coming in from a big memory read quickly, half the bandwidth is carrying pointer addresses not data, so the throughput is halved for that, too, and the processor cycles were already able to happen much faster than the main memory bus max speed. If it's not contiguous memory, you don't know in advance where it is to request all the right memory areas at once - not until you read sequentially to the last item pointer and find there are no more.

Maybe if they are both small, both in contiguous memory and go into Level 1 cache after a single main memory delay, it could be only ~2x time, but the more data there is overall, the more chance the linked list will bust the cache or be discontinuous in memory. And on the plain array side, it might be possible to speed up with SIMD/SSE instructions to spend fewer cycles adding and storing per element, which the linked list approach might not be amenable to at all[2], then best case might be ~4x slower, worst case ~500x slower.

[1] https://www.prowesscorp.com/computer-latency-at-a-human-scal...

[2] https://stackoverflow.com/questions/10930595/sse-instruction...


Especially since the array will probably be linear in memory, so will work well with the cache, and the tree probably will have jumps all over the RAM unless the implementation uses a backing array.

Not quite. Many other datastructures can be shoehorned into contiguous, cache efficient representations.

That's a really hard thing for a compiler to predict though. I recently got a 2x performance boost just by switching to an array of struct of array representation (making the cache lines have more data that actually matters), and the only reason it made a difference was the access pattern of a mildly non-trivial algorithm acting on that data (not hard per se, but definitely hard for a compiler to optimize through).

Don't forget that good cache locality also can cause data being pulled into cache that the prefetcher did know nothing about.

I can create you a shitty linked list that fits perfectly in L3, but still has terrible cold cache performance because each individual cacheline has to be pulled in one by one.


Any game engine architect worth her salt would know to not speak so absolutely about cache coherency, and that if you're dealing with a use-case where iteration is massively infrequent but random insertions and removals are likely, you could be better off with the linked list :)

Even with perfect cache locality, a linked list will be significantly slower to traverse than an array because of worse pipelining and parallelization. Partially unrolled lists can help, but at the cost of additional complexity.

Of course not all container accesses require traversal.


This is entirely hypothesising but I don’t see how these wide trees are awful for the cache. In particular I don’t think they would be much worse than a flat array of objects—the problem is, I think, that objects are references and iterating the things in an array of pointers usually sucks.

For a HAMT, With depth (eg) 2, you should be able to prefetch the next node in your iteration while you iterate the current node of 64 elements. Actually iterating through the objects is going to suck but hopefully you can prefetch enough of them too. (Or maybe hotspot could online things enough to improve memory access patterns a bit to take advantage of ILP). There’s still the problem that you can’t read main memory sequentially except that often all the objects in a HAMT were allocated sequentially so you can read main memory sequentially (as allocation is typically just bumping a pointer, allocation-time-locality tends to correspond to address-locality)

next

Legal | privacy