It's nice to be able to have productive discussion on the internet!
We can agree on almost all points, I think.
The CS 101 bit you mention is a good point: intrusiveness matters for asymptotics in CS, and should be taught in academia, not (very small portions of) industry.
By the way, when you mention "a vector whose elements are linked list heads", a much more typical scenario is an array/vector whose elements contain one or more list heads.
You were wondering about the kinds of software I was working on that needs this stuff: high-performance storage controllers. We need to maintain a lot of concurrency, handle a lot of kinds of failures, etc. We often require the same objects (e.g: an ongoing I/O request) to be looked up in various ways, associated with failure domains, timed out, etc. So we want it organized by many different data structures, and we need the O(1) of deleting it from the various structures when an I/O request dies.
We also shun dynamic allocations, aside from relatively rare memory pools for large structures that contain the various allocations we need. Intrusive style allows us so many nice things within this style:
* Avoiding the runtime costs of dynamic allocations
* Avoiding the memory costs of extra indirections incurred by dynamic allocations and STL (non-intrusive) style
* Avoiding handling out-of-memory errors in every single code path: there are almost no allocations anywhere, almost all functions become "void" error-free functions!
* Having optimal asymptotics for our operations
* Having reusable generic data structures in C without templates (we dislike C++)
Compared to all this, the STL stuff is just horrible. STL is widely considered to be a superb library, but I find it horrid.
Assuming that your list elements are in the neighboring cache lines (also how it compares to vectors in real life performance, as opposed to time complexity, really depends on how often your vector has to grow for your workload.. often a reasonable size can be chosen up front to minimize this, for example), but your point is of course valid.
Though I have seen vector-like structure implementations that ocmbine arrays and linked lists so that when you run out of space, it allocates a new array and links it to the end of the old array. I've implemented memory pooling systems like this myself. Works well, because as long as you know which bucket/array the element is in, you have random access and you also have good linear scanning performance, and extending the size doesn't require memmove's.
I'm not saying that linked lists have no uses (my original comment about never was not meant quite serious - never is much too strong a word) - they certainly do - but I am saying that I feel they have much more limited use cases than most people seem to think.
I agree that intrusive linked lists have many advantages that non-intrusive linked lists do not have. The main advantage of intrusive data structures is, as you said, that membership in multiple structures is in one place allowing constant time deletion from all structures.
Replying to your other comment:
Deleting from the middle of the list is an extremely common requirement, when you have objects that need to be enumerable in some order[s] and sometimes deleted.
I don't know what work you do in C or C++ (in other languages, I happily use their lists data types without caring if they used linked lists under the hood, because chances are, if I care about absolute performance, I'd have been using C or C++ to begin with), but in my own work, requiring both order and common insertion/deletion in the middle has been rare - or the elements and lists were so small that linear scan and copy was fast. But I guess it depends on your work, your logic here certainly sounds reasonable to me.
When you need indexing, use a vector. When you need quick add/remove, use lists. When you need both, use both.
Sure, you could use, say, a vector of elements that are themselves nodes in an intrusive linked list. Using a random access data structure to index another data structure is something I've done myself. I suppose I'm mostly arguing against the naive use of linked lists that seems to be rampant, because I think that most uses do not do all the interesting things you mention, in which case, IMHO, my points still stand and linked lists have a narrow use case.
You don't need to jump around. Say you have a request object you found via a hash table. After handling it you decide to destroy it. You now want to delete it from various lists it is in. You can do it on all lists in O(1) for each list. This is relatively cache-local (only need 2 extra random accesses, potential misses, for each list).
See, this is the kind of thing that I was looking for when I asked you to elaborate. You're not really using the list as a convenient growable array replacement (as you say yourself, you don't think of them as containers), you're using them as a secondary structure to maintain order in cases where middle removal or insertion is important (or maybe the other data structures are the secondary ones.. whatever). I think I better understand what you meant all along now and your reasoning makes sense to me, I can definitely see the use of this and agree that the problems of linked lists basically boil down to the following: they're seen as containers.
I think in that case, everything I said is totally true. I was definitely somewhat wrong about intrusive lists in the cases where some combination of these are true: another structure is used when lookup is needed, middle insertion/deletion is common, copying is expensive, order is important, elements are members of multiple lists where membership is commonly modified together.
I regularly have my data objects within a hash table, a list, and a search tree simultaneously and efficiently, with no dynamic allocation to sustain any of that.
This sounds good - I don't think most people really do this though. The problem isn't just std::list (or std:: containers in general). Your original reply to me was "If you really think that, you've missed CS 101" but perhaps you missed CS 101, because in CS 101 they (in mine and in any course notes and data structure books I've read) teach the common unobtrusive-linked-list-container data structure for which my points are completely valid. They're not teaching your use cases, or if they are, its certainly not nearly as common.
EDIT: I wanted to add that all of this really depends on what you are storing to begin with. If you are only storing an integer or two, the overhead of the links will kill you and your cache efficiency. If on the otherhand you are storing lots in each node then a few extra pointers isn't going to do much. Also if you're only storing a few bytes (or a few dozen) copying isn't all that expensive and then whether lists or vectors are faster depends on access patterns like I said in my previous comment. A lot of, eg, my python lists are lists of numbers. In high performance c++ code a lot of my vectors simply contained ids, numbers, points (2d or 3d) or similarly simple data.
> I want to be able to dynamically add and remove objects from the list without any allocations at all. The only way to achieve that is an intrusive linked list design
So with that, let me reply to your most recent comment:
> No, for the use cases I'm thinking of, I really do just want an intrusive linked list. Why would I want to invent a convoluted way to adapt std::vector when an intrusive linked list does exactly what I want?
You said an intrusive linked list is the only way to achieve that design. I'm telling you it's not the only way to implement a zero-allocation linked list.
If an intrusive linked list is exactly what you want then great for you. You haven't clearly articulated exactly why you must have an intrusive linked list. I've given you an alternative. You're rejecting it out-of-hand because you don't want to "invent a convoluted way to adapt" already-existing containers.
> Why would I want to invent a convoluted way...
You only think it's convoluted. I argue that it's not convoluted at all.
There are many reasons you would not always be able to use an intrusive linked list. Intrusive linked lists are... intrusive. They require that you:
- directly modify T (again, might not be possible if you don't own the code for T, eg it comes from a third party library)
- or have T as a member (which also might not be possible, particularly whether T must be created by a factory and how convoluted that factory is)
- or else have capability to inherit from T (which might not be possible, see `final` keyword).
You haven't stated any of these (or any other) reasons. You simply stated that an intrusive linked list is the "only" way to achieve zero-allocation linked lists. Without stating why you must have an intrusive list, I have merely offered another option. If you think using the standard library is convoluted here then I wonder what other data structure concepts you will have trouble understanding and suggest that C++ isn't the right language for you.
Intrusive lists are very different beasts than lists by value.
There are less allocations involved (perf and points of failure), values can be inserted on different lists without new allocations, deletion is O(1), elements can be heterogeneous (different sizes and types), etc
etc
I seldomly use linked lists, but most of the time i prefer them to be intrusive. There of course are intrusive list implementations on C++.
> That's true, but in modern game engines, objects of the same type share an allocation pool.
I don't think pool allocation guarantees memory contiguity: what if we have a pool with two elements free, those elements being the first and last ones, and we allocate two elements and stick pointers to them in our vector?
> a vector implementation that grows by a factor of 2....shouldn't be any significant cost
It depends on what you mean by "cost". I agree that the enlargement cost is acceptable from an overall CPU budget sense, but each enlargement can still cause a latency spike, since the cost for enlargement is still large (especially considering the cost of taking the heap lock) for _that_ insertion. Sometimes consistency is more important than throughput.
> vector is still a better choice because it prevents the CPU from accessing a bunch of intermediate list nodes which will pollute the CPU cache
Who said anything about intermediate list nodes? The nice thing about an intrusive linked list is that once you pay for accessing an element's cache line, the rest of the accesses are nice and local. There are no intermediate nodes, and all costs are smooth and consistent. Besides: you can issue a memory prefetch instruction for node B while you process node A, further hiding latency.
While a vector is value types might be best in most situations, if you can't use that, an intrusive linked list may or may not be better than a vector of pointers.
> You can very much have a single vector owning the memory, and do all other ordering over auxiliary vectors of indices. Should be cheaper and faster than holding more linked lists.
Cheaper in space depends on the percentage of elements in the auxiliary list. An intrusive list has space for every element to be in the list, which is wasteful if few are. A vector that grows by doubling could waste nearly half of its elements. Indexes can often be smaller than pointers, though, which favors the vector approach.
Faster is debatable. Iterating the vector of indexes is quite fast, but indirecting into the data in a random order is still slow. An intrusive linked list doesn't need to do the former, only the latter. (Then again, it also bloats up the data slightly, which matters for small items since fewer fit in cache.)
The main reason why linked lists could still be at an advantage is if you don't want allocations in your auxiliary list manipulation path. Maybe you don't want the unpredictable timing, or maybe you can't handle failure.
I agree on tombstoning, but note that you're giving up some of the vector advantages by doing so. Your processing loop now has a much less predictable branch in it. (Though really, the vector probably still wins here, since the linked list is built out of data dependent accesses!)
Sometimes these auxiliary lists need to contain things that are no longer in the main list, too, as in the case of a free list. (And if you swap such things to the end, then you've just invalidated all of your indexes.) And non-intrusive linked lists can point to variable-size elements (though they lose most of the other benefits of an intrusive linked list.)
Anyway, it's the usual "use the right tool for the right job", I just claim that linked lists are sometimes the right tool.
(Random side note: I use "indexes" instead of "indices" these days, for a silly reason—"indices" is a fine word, I have no issue with it, but it seems to encourage some people to use the non-word "indice" which is an abomination.)
Yeah, that's true for values you're willing to allocate inline. The key advantage of intrusive linked lists is being able to add and remove normally allocated objects without having to construct separate list nodes. The STL does have splice(), but it's awkward -- you always have to keep the object in a list and refer to it using iterators instead of normal references.
No, for the use cases I'm thinking of, I really do just want an intrusive linked list. Why would I want to invent a convoluted way to adapt std::vector when an intrusive linked list does exactly what I want?
(Example use case: Some number of objects want to register themselves as observers on some event, and unregister themselves later, in arbitrary order. The same object may register and unregister itself many times. std::unordered_set<Observer*> would be the best fit if performance doesn't matter but an intrusive linked list requires no allocation at all on behalf of the list.)
You forgot insertion and removal, which intrusive lists provide in constant time and vectors do not.
If you can own the objects, linked list (of any kind) is usually not appropriate. The essence of the efficiency of an intrusive linked list is that the same indirection that must point at an object that is not owned is reused to give the list structure. Without this trick, linked lists are not much good.
1. He's not comparing apples-to-apples between `std::list` and his intrusive linked list; the proper analogue would be to unlink a node from its iterator, for which the running time is still O(1).
2. The main (only?) reason to use intrusive lists is for the single indirection (which helps both memory and speed). In his example for how using std::list would crash his server because of the O(N) node removal, he's just not storing the right data for each connection (again, use an iterator).
3. He looked at boost's intrusive list, but I'm guessing he didn't actually try it out. The examples are pretty good, and it's much easier than it "looks". (That is, boost libraries can look intimidating when you first look at them because they're so template-heavy.)
4. It may even be that a vector, plus an auxiliary data structure for lookup, may be faster.
More often than you'd think. The simplification of memory management when you have intrusive linked lists is a huge win, both for perf and correctness.
What's "large"? Even at thousands or tens of thousands, vector-style lists can be a win simply because they're faster to iterate through, they optimize locality, and they don't require pointer chasing.
At very large sizes, deletions from the interior can be painful, but at very large sizes you need to be customizing for performance (if that matters) anyways. Instead of deleting, for instance, you can invalidate, and then amortize the memcpy across many deletions; a straightforward time/space tradeoff.
The big problem I have with "performant" linked lists is that malloc is death to performance. The single biggest profiler payoffs I've ever gotten --- order of magnitude improvements --- have been from getting rid of calls to malloc. Yes, you can custom-alloc a linked list, but you're then getting to a place where lists and vectors are starting to converge on each other.
I think C programmers use linked lists instead of "vectors" because realloc resizing is scary, and because every CS student learns linked-lists in their first month of class.
>But then working with references is bad in general? If your vector stores references to objects it's bad? Your object fields point to strings - bad too.
It's not strictly bad, but it's useful to minimize the number of pointer derefrences you have wherever you can. A non-intrusive linked list will have 2 pointer dereferences to access any bit of data. You'll also have n+1 pointer dereferences to access element n. If you have fixed size small objects, then a vector of values is almost always better than a vector of pointers to the small objects. An intrusive linked list will save you a pointer dereference, but you still have the n dereferences to access element n.
>Should we ditch Lisp
Lisp's lists model a linked list with chains of cons cells, but there's no hard requirement for them to actually be implemented as linked lists. A typical approach is to implement lists in terms of Bagwell's VLists, which are a kind of middle ground between linked lists and vectors. You have reduced number of pointer dereferences, plus increased cache locality, whilst still being able to log n index, insert, delete, and not require large up-front allocations.
>and Java
If you subscribe to the "everything is an object" model religiously, then yes, you're probably doing harm. As always, there's no hard rules here and it always depends on your problem and data access requirements. You can usually get performance gains by using memory pools, arrays of structures/primitives, and entity systems instead of inheritance hierarchies.
Vector is a fine data structure, but doesn't do the same job.
The purpose of intrusive linked lists is to provide constant time, allocation-free insert/remove from multiple sequences that all refer to the same object (identity, not value), given only a pointer to the object.
Multiple vectors obviously don't provide this, since vectors contain objects by value. Multiple vectors of pointers can work, but then removal requires either an iterator or a linear scan for each vector. If your object is in many lists at once, that gets pretty messy.
In particular the way that game engines use this, with objects packed in preallocated arrays, gives constant time insertion and deletion from multiple sequences with absolutely zero allocation or copying at any time. Vectors simply do not provide this functionality.
> 1: you care about ordering - often you don't care about ordering and in that case, there is no need for a linked list because you can achieve O(1) insertion & removal then too: insertion can always be at the end, removal can be a "swap with end element, remove end element" operation.
Whether you care about ordering or not, linked lists work great. Adding to an array is only amortized O(1). It is worst-case O(N). Why pay O(N) for the worst case when you can pay a cheap O(1) always?
> 2: inserting and/or removing from the middle of the list is a common operation, if it is not a common operation, then the added cost of doing so with a vector may still be outweighed by a vectors other advantages
Deleting from the middle of the list is an extremely common requirement, when you have objects that need to be enumerable in some order[s] and sometimes deleted.
> 3: you do not require random access - lists do not provide random access and lookup is O(n). At the expense of additional complexity in implementation and more memory overhead, you could reduce lookup to, OTOH, O(log n) by using a skip-list
Here is a false dichotomy dictated by the STL. With intrusive lists, you can have both your vector and lists of various orders. There is no contradiction.
When you need indexing, use a vector. When you need quick add/remove, use lists. When you need both, use both.
> 4: you do not iterate through the list often - if you do, you are likely going to blow the cache and mess up prefetching due to poor cache locality. Iterating through an array-based data structure can be much faster in this case.
Yes, if you want quick (repeated) enumeration, put it in a vector. Again, this doesn't contradict also putting it in lists.
> The biggest issue they seemed to mention, though, was cache locality - I fail to see how intrusive linked lists solve this.
Cache locality is based on your use pattern. When enumeration isn't your common operation, vectors don't have better locality than lists. For example, a sorted list of requests where you may want to time out the oldest one will only refer to the head and tail of the whole list. Cache locality is great.
> then you end up jumping around the preallocated nodes anyway and after a while will lose any cache-friendliness you may have had.
You don't need to jump around. Say you have a request object you found via a hash table. After handling it you decide to destroy it. You now want to delete it from various lists it is in. You can do it on all lists in O(1) for each list. This is relatively cache-local (only need 2 extra random accesses, potential misses, for each list).
> C programmers are trained to use linked lists. They are the first variable-length containers most programmers come into contact with and so C programmers tend to be imprinted on them like ducklings. Linked lists are a poor general purpose container.
I think the false premise here, espoused by the STL, is that a data structure is a "container" at all. Your data can be "contained" by anything (the stack, the heap, a vector, ...). The data structures involved (linked lists, hash tables, search trees) all do not contain the data. They organize the data.
When using vectors as containers STL-style -- you cannot really have your data organized by multiple data structures efficiently.
I regularly have my data objects within a hash table, a list, and a search tree simultaneously and efficiently, with no dynamic allocation to sustain any of that.
STL cannot do this because of the data-structure as "container" philosophy.
Yes, if you have a simple choice between a vector and a linked list, then the vector is vastly superior due to locality and (for non-intrusive linked lists) allocations. So much so that vectors often win even when you're doing lots of O(n) deletions that would be O(1) with a linked list.
But that doesn't mean that linked lists are useless! A vector gives you a single ordering. What if you need multiple? What if you need a free list, which you're never going to be iterating over but will just grab off an item at a time? I find it quite common to have one "natural" order, for which I will use a vector (or equivalently, a bump allocator of fixed-size items), and then one or several auxiliary orders like the entries belonging to a particular owner or the ones that will need to be visited for some sort of cleanup or an undo list or some sort of stack or queue of pending items to process. Your common iteration will be in memory order, but that doesn't mean you won't ever want to do different iterations.
It annoys me that this is always omitted, with the attitude that linked lists are obsolete and useless because vectors be moar better faster gooder, to the point that it's a waste of time to learn how to manipulate linked lists anymore.
I guess a lot of this is probably due to the popularity of high level languages, where you're just dealing with references to everything in the first place. But in those, the arguments in favor of vectors are often not valid because that blessed golden vector of Objects is giving you no more locality than giving your Object a `next` field: the vector of Objects is represented internally as a vector of pointers to the actual Object data, so your oh so nicely cached lookups are doing memory reads that are 100% pure overhead compared to following a `next` link from an Object that fits into a cache line. In both cases, your performance is going to be limited by the locality of your Object data, which is the same whether you have a vector of pointers or Object data with an intrusive pointer.
Also, if you have an existing setup and need to introduce a new list, it is sometimes far easier to drop in a `next` (and maybe `prev`) field than to refactor everything to accommodate a new vector. Especially since the vector will move all of your data when resizing the vector, which invalidates any pointers you might be using for other purposes. If you'll be iterating that list frequently, then the vector may very well be a good idea. If it's just for error cases or slow paths, then linked lists really aren't bad at all.
I'm not trying to argue for linked lists here so much as arguing against the blanket arguments against them.
Note that this (and the article) describes an intrusive linked list. Extrusive linked lists (like you might see in a class library or CS 101 project), where the node structure is heap-allocated separately from the data that it points to, have very few practical advantages over vectors, which is why standard libraries are increasingly defaulting to vectors even when the class is named "list".
1. Linked lists require extra per-node memory for pointers. Plain lists (vectors in C++) don't.
2. Linked lists have poor cache locality. Each item is in a different place.
3. Using Person* instead of Person for the list objects results in twice the number of objects. In C++ you have this choice; in many higher-level languages like Java, you don't; generic linked lists only support double-object way. (You could of course manually add pointer fields to your objects to obtain a single-object solution, but this defeats the purpose of write-once-run-with-any-element-type idea of a reusable container data structure library.)
4. You often only care about adding and removing from one or both ends. A vector or deque has O(1) removal in this case as well (assuming it's allocated with adequate space, or you don't care about transient CPU spikes due to reallocations early in your program's run, before it reaches its max size).
5. Large numbers of objects are more stressful on memory allocators and garbage collectors.
We can agree on almost all points, I think.
The CS 101 bit you mention is a good point: intrusiveness matters for asymptotics in CS, and should be taught in academia, not (very small portions of) industry.
By the way, when you mention "a vector whose elements are linked list heads", a much more typical scenario is an array/vector whose elements contain one or more list heads.
You were wondering about the kinds of software I was working on that needs this stuff: high-performance storage controllers. We need to maintain a lot of concurrency, handle a lot of kinds of failures, etc. We often require the same objects (e.g: an ongoing I/O request) to be looked up in various ways, associated with failure domains, timed out, etc. So we want it organized by many different data structures, and we need the O(1) of deleting it from the various structures when an I/O request dies.
We also shun dynamic allocations, aside from relatively rare memory pools for large structures that contain the various allocations we need. Intrusive style allows us so many nice things within this style:
* Avoiding the runtime costs of dynamic allocations
* Avoiding the memory costs of extra indirections incurred by dynamic allocations and STL (non-intrusive) style
* Avoiding handling out-of-memory errors in every single code path: there are almost no allocations anywhere, almost all functions become "void" error-free functions!
* Having optimal asymptotics for our operations
* Having reusable generic data structures in C without templates (we dislike C++)
Compared to all this, the STL stuff is just horrible. STL is widely considered to be a superb library, but I find it horrid.
reply