Hacker Read top | best | new | newcomments | leaders | about | bookmarklet login

He says because of cache misses due to each element being allocated individually.

But then working with references is bad in general? If your vector stores references to objects it's bad? Your object fields point to strings - bad too.

Should we ditch Lisp and Java?

Note, data should not be continuous to stay in cache. Cache is more like paged virtual memory, it consists of cache lines.

Before subscribing to the "never use linked lists" view I would like to see a nontrivial program rewritten from lists and references to the "allocate everything at once" approach and measure the performance.

If you want to avoid references and allocate everything in place, it may force your program to do a lot of copying.

Also, not only CPU performance matters - programmer performance is important too. So-called "high-level languages" try to optimise this.



sort by: page size:

The performance of vectors comes from iterating through them and letting the cpu prefetch items before you need them. Random access in a vector doesn’t really get you that if the vector is larger than your L1/L2 caches, which linked lists would be in anyway if you used them recently enough.

> Is the hope to avoid paying the cost of an extra pointer dereference when we're using the first map? Or does independently allocating each object hurt cache locality or something like that?

Both.

In practice, I probably wouldn’t use a hashmap for the first container that actually owns these items. When I do expect gigabytes of data, in C++ I use something like vector<vector<tValue>>, where the inner vectors are of the same fixed size (except for the last one), e.g. 2-16MB RAM / each. If I need to erase elements, I include a free list such as this one: https://github.com/Const-me/CollectionMicrobench/blob/master...

But the exact container is not that important here. If you don’t have that many values, it can as well be a standard vector.

The point is, C++ allows composing these containers making higher-level ones, such as this indexed array example, using pointers to link individual items across them. This feature allows building sophisticated and efficient data structures that are still possible to reason about.


> pointer-jumping will throw away the (absolutely astonishingly large) benefits of the cache, unless you’re careful.

Right! Yes, that’s part of my point, STL::list isn’t being careful with cache, or with allocations. Really it just rarely makes sense to even compare STL::list to STL::vector as if they’re otherwise equal choice. Usually the choice is (or should be) driven by constraints, not by which has a slight perf edge, right? Inside a memory manager, use of a vector isn’t usually considered a choice. Maybe it’s possible to build a free page vector, but I think isn’t common, and people usually pay the costs of pointer chasing on the free list because there aren’t practical alternatives.

> the <vector> growth strategy does not free+malloc/reallocate on each insertion

Yeah very good point. Does STL::list do the same for the container of pointers? I don’t even know, but maybe it can’t, and maybe the primary perf advantage of STL::vector over STL::list is due to vector’s amortized mallocs?


That's nice little demonstration of superior vector performance for a certain set of (very common) conditions:

# The value type (int in this case) is very small, in fact it's pointer-sized.

# This entire benchmark easily fits into L2 cache. The benchmark only goes up to 100000 elements, which for an int is only 390K. This can be an advantage of vector, but it's not necessarily such a slam-dunk in real-world programs where there are multiple data structures and other threads contending for the use of the cache.

# The value type has 'trivial' copy/move constructors and assignment operators. It's POD and outright memmove()-able.

# The benchmark code is only able to use list nodes one time. It doesn't take advantage of a list's ability to move items from one container to another without reallocation.

# The benchamark is thoroughly randomized. If it were more typical real-world data (such as supplied by another algorithm or even a malicious attacker) then you would want to consider the possibility of pathological access patterns arising.

So if you meet all of these conditions, definitely use a vector. If not, life is back to being complicated again.


You are right and this is will know, but not only for the reason you state. Cache locality is so much better on an array/vector compared to linked list and that is really important.

That's absolutely right, but it's not just large objects. Any situation where you follow links rarely compared to your other memory accesses then the cache performance of links becomes negligible and the benefit of simplicity and O(1) operations can shine. Vectors for scrubbing through things without changing their structure, lists and trees the opposite.

Scrubbing through things quickly is empirically more common, hence the standard advice to use vectors by default.


> Embedded loves contiguous blocks.

Quite the opposite. When you have no dynamic allocations, vectors are out of the question, while pool of nodes is fine and sound.

And quite often you can't copy elements from one place to another in embedded or operating systems, since many consumers can store a pointer to this element. Of course this could be solved by usage of vector of pointers, but this completely ruins the ``cache locality'' argument.


> the array is contiguous in memory - so operations like search or moving stuff around after addition make extremely good use of CPU's cache

Also - if I see someone try to use a linked list for an enormous data structure again.... Wow it does not scale worth crap because it turns out that the hardware is actually important, and contiguous memory is amazing.


You're misunderstanding the advice about contiguous. It's not that it's more likely to stay in cache, but if you're accessing data sequentially it's more likely the data you're going to access next is already in cache.

Most (all I've read/looked at) benchmarks in Java have data structures backed by linked lists utterly smashed by things implemented by an array.

There was in the last year or two a good c++ talk where a game or game engine changed their storage to be more array like and got roughly a 30% boost in performance. Memory locality is usually king, which is why linked lists are rarely used.


The slide deck is about performance pitfalls of OOP languages, which can surreptitiously introduce cache misses, random access to memory, and excessive levels of indirection into your code:

> Prefetching

> Data accesses are now predictable

I'm reminded of a Stroustrup talk in which he says that vectors will always outperform linked list structures, by taking advantage of spatial locality and predictable access patterns to enable better cache prefetching. He even has a slide on C++ vs "True OO" style object graphs, and it looks remarkably similar to what's described in the OP: https://www.youtube.com/watch?v=YQs6IC-vgmo&t=3m53s


> Insertion and deletion are prime examples, often requiring you to move large chunks of memory.

on modern cpu's with huge caches, i seriously doubt this claim.

fwiw, i have played around with synthetic problems where: i insert sequence of random integers into a sorted sequence, then remove those elements one by one as determined by a random sequence of positions. and almost always, vectors outperform lists by at least couple of orders of magnitude.

or to put it another way, it is almost never about either lists / vectors or something else, and boils down to couple of 'rules' e.g. access data predictably (avoid trashing the cache), keep data compact (more cache utilization) etc. etc.


In game dev, you sometimes avoid holding on to pointers for more than a single frame or operation because of this. Especially since many things get created or destroyed over time, and we want to keep our arrays densely packed to reduce cache misses.

We are trying a database like approach for retrieving references to things in our current project. It adds a small performance penalty, but it’s far from a bottleneck and it’s been nice to work with


> most of these data structures end up sharing a lot of data, and a “new” structure is often only the diffs from the previous version.

Like everything else in life, those data structures come with tradeoffs. Accessing an array element is an indexing operation into a continuous block of memory. The indexing is built into the instruction set as an addressing mode, the array's memory is continuous (so better locality of reference for caching), and if you're traversing it predictably, prefetch hardware can being it into cache before you issue the instructions that do the read.

The same operation in a persistent vector is a function call, several pointer dereferences, and a more complex on-memory structure with more overhead. It works elegantly, but it's a completely different (and likely slower) level of abstraction from a simple in-memory vector.

A few years ago, I switched a toy search program away from ImmutableJS and over to naive copying of raw JS data structures. I don't remember the exact performance delta, but it was something like an x10-100 speedup, for code that was otherwise structurally identical:

https://github.com/mschaef/react-matchstick/commit/070802b69...

In this case, I had a goal to achieve, a timeline on which to achieve it, and it was the ability to switch to simpler, closer-to-the-metal data structures that made it happen. Had I needed more optimization, the most logical approach would have been to remove even more copying and rely even more heavily on mutation. This shouldn't come as a surprise, because at their core, these machines are fundamentally built on mutation, and there's a cost to pretending otherwise.

Now, in fairness to the persistent structures, this is something like a worst case scenario for their use - the algorithm is almost entirely dominated by manipulating relatively small data structures without much opportunity for sharing. It was also single threaded, small, and developed by a single developer in a small amount of time, so the organizational benefits of immutability were not apparent either.

If this was a huge redux state shared across a team of 100 developers and updated only a few times a second, I could probably tell a completely different story. It's easy to imagine (based on firsthand experience!) the kind of havoc that can be inflicted on a project with an erroneous update.

I get the enthusiasm for functional programming and persistant structures, etc., but at the end of the day it's just an engineering approach. One of many, and there's room to do the work to choose between them.


> A problem there is that in memory the arrays are of course laid out one after the other, which actually destroys cache locality if you need to access more than 1 property inside a loop (it will need to load back and forth to the different property arrays), so it's a somewhat dumb approach.

Caches are perfectly capable of dealing with more than one stream of data (there are some very specific edge cases you may have to consider), accessing multiple arrays linearly in a loop is generally more efficient than accessing a single array of structs when you don't use almost all the struct elements.


I’ve considered making a vector sorta thing implemented as a linked list of arrays. Assuming you sized the arrays to fit well inside the CPU cache, are there really many tradeoffs?

> On the other hand, pointer chasing often isn't as bad as you think it might be.

It kinda really is, though. In addition to cache line locality, serial access also benefits from being speculatable. The CPU can't very effectively speculate past a pointer chase (and on arm little cores it doesn't even try), so those become pipeline stalls.

Even if the pointer happened to be close-ish, it's still going to end up being a stall more often than not.

And an allocator / GC is only going to lay out a linked list in any sort of predictable way if the linked list is built up all at once, in which case a linked list is obviously not the right data structure anyway ;)


Well, I think you'd agree to that if random access is a requirement, arrays are nearly always going to work better.

Even then, that might be a good strategy if inserting into a vector type data structure required a realloc everytime. However, most implementations use an exponential strategy, so for a million item vector grown one at a time, you'd see only 10-20 reallocs. Which is to say, the realloc problem is only a problem in memory constrained -- think embedded or kernel -- systems where a memory overhead of 1.5-2x isn't viable.

You can do the same thing -- and some linked list implementations do -- with linked lists, where you grab a block of nodes at a time. If you don't though, just in inserts, the vector will be faster. Mallocs are non-trivial in most applications and can easily become the bottleneck.

I think the thing you might be underestimating is how much faster it is to iterate over a vector than a linked list, nevermind random access. Cache issues really, really do matter. If you want to see this in action and happen to have a SoC dev board around, disable the data cache and run some benchmarks with vectors. If you re-enable it, you will easily see a 10x-20x improvement.

If you watch https://www.youtube.com/watch?v=YQs6IC-vgmo, he goes into some of the use cases and some benchmarks that might be compelling to you.


The question had 2 goals:

1) Do you think about cache at all or is it just something you heard mentioned as important that one time?

2) It's a good lead-in to discussing the effects of cache in algorithms. How that conversation goes helps me to understand how that person thinks and discusses complex problems.

A good answer would be "I'm not sure, but probably way, way slower because linked list can point all over memory but arrays cache really well."

An excellent, A+ answer would be "In the best case it might not be too much slower if you have an intrusive linked list is arranged sequentially in memory like an array like onekorg explained. But, in practice most will be 20-200x slower because they are usually implemented as pointers to nodes containing pointers to data and each piece is allocated piecemeal in a already fragmented heap. Uncached memory reads can take 100+ cycles and summing an int is not enough work to hide that even if the CPU speculatively prefetches."

I mainly expect a surprised reaction that they could be so slow and looked forward to the follow-up discussion.


Even O(1) operations on lists can break cache locality. Since you're chaining using pointers, all bets that your cells are nearby are off. With vectors, they always are.
next

Legal | privacy