Hacker Read

tialaramex · 2021-08-23 14:56:22

Yeah, that's fair. If it's the standard library hashmap, then yes, that's an Abseil Swiss Table (well, not actually, but that's the thing it's reimplementing) and so yes it will have great performance if correctly sized and if it's relatively sparse I'd bet it's faster than the heap allocated Vec idea because so much of it will fit in your L1 cache and the Vec would not (unless you've got a complete beast of a CPU dedicated to this task).

I would add the size hint if your program has any reason to know what hint to give (blind guesses are worse than just calling new) and otherwise forget optimizing it until it's too slow.

reply

gjulianm | karma 4299 | avg karma 4.04 · | 2022-01-09 06:27:55

Yes, in the end I had to go to a custom hashmap/linked list combination to get the best performance for our specific use case (basically a flow table with LRU cache behavior and also expiration of flows by time limits).

tialaramex | karma 26322 | avg karma 2.85 · | 2021-08-23 14:24:43

If you decide that you can't afford for it to live on the stack, you could use a Vec so that it lives on the heap but it's still a single pointer dereference to get the QueryInfo (or not) which is a similar value.

The nice thing about the stack is that it's warmer (more likely to be in cache), but this is less valuable for these larger sizes where the structure is too large to all fit in fast cache anyway.

Actually depending on how sparse the map is, that's a reason not to do this at all. I don't know what a FxHashMap is, but if you've got a u16 key and only a small number of entries in your map, some maps won't need more than a few cachelines to store that data, far smaller than the 1.5MB contemplated above.

Anyway with anything like this, measure three times, mark twice, cut once. If you don't have a performance problem, needn't do anything, if you do have a performance problem you need to measure it before you try to fix it.

reply

marshray | karma 11502 | avg karma 2.51 · | 2013-01-19 22:09:35

That's nice little demonstration of superior vector performance for a certain set of (very common) conditions:

# The value type (int in this case) is very small, in fact it's pointer-sized.

# This entire benchmark easily fits into L2 cache. The benchmark only goes up to 100000 elements, which for an int is only 390K. This can be an advantage of vector, but it's not necessarily such a slam-dunk in real-world programs where there are multiple data structures and other threads contending for the use of the cache.

# The value type has 'trivial' copy/move constructors and assignment operators. It's POD and outright memmove()-able.

# The benchmark code is only able to use list nodes one time. It doesn't take advantage of a list's ability to move items from one container to another without reallocation.

# The benchamark is thoroughly randomized. If it were more typical real-world data (such as supplied by another algorithm or even a malicious attacker) then you would want to consider the possibility of pathological access patterns arising.

So if you meet all of these conditions, definitely use a vector. If not, life is back to being complicated again.

reply

marshray | karma 11502 | avg karma 2.51 · | 2012-10-02 03:39:11+00:00

* A cache miss from L2 to RAM is extremely slow*

Yes, but a properly-sized hash table will result in an average of 1 to 2 cache misses per lookup regardless of how many elements. This is far better than what you get with other structures, except perhaps in specialized cases that happen to follow tree traversal.

and can be about a million times slower if you have to page fault all the way into swap.

Disk based data requires specialized algorithms.

accessing every element once in a large naive hash table will give you O(n) page faults on average whereas accessing every element sequentially in a vector will give you O(n/d) page faults where d is your page size. While asymptotically similar, in real life that constant d can be gigantic.

Sure, a vector is better when you only ever access elements in a single sequential order and don't add or remove elements anywhere except the end.

That being said, I'm not sure how the C++ unordered_map handles this

C++ defines a method of iterating over all the elements of an unordered_map. I think a typical implementation will just treat the hash memory as a vector and return values from the occupied buckets.

reply

Const-me | karma 5797 | avg karma 1.88 · | 2017-09-06 04:05:03+00:00

> Is the hope to avoid paying the cost of an extra pointer dereference when we're using the first map? Or does independently allocating each object hurt cache locality or something like that?

Both.

In practice, I probably wouldn’t use a hashmap for the first container that actually owns these items. When I do expect gigabytes of data, in C++ I use something like vector<vector<tValue>>, where the inner vectors are of the same fixed size (except for the last one), e.g. 2-16MB RAM / each. If I need to erase elements, I include a free list such as this one: https://github.com/Const-me/CollectionMicrobench/blob/master...

But the exact container is not that important here. If you don’t have that many values, it can as well be a standard vector.

The point is, C++ allows composing these containers making higher-level ones, such as this indexed array example, using pointers to link individual items across them. This feature allows building sophisticated and efficient data structures that are still possible to reason about.

reply

whack | karma 18044 | avg karma 7.81 · | 2016-08-30 01:28:04+00:00

It's a very interesting experiment/conclusion, but it rests upon one assumption: The assumption that the entire dataset has been preloaded into the L1/L2/L3 caches.

This assumption is a shaky one to make, and is easily violated. Imagine if you have a hashmap that is small enough to fit entirely in L3 cache. However, most of it has been evicted from the L1/L2 caches, by other data that the core has been reading/writing to as well. Eventually, the thread returns to the hashmap and performs a single lookup on it. In this scenario, the time required will indeed be O(1).

So what you really have is a best-case-complexity of O(sqrt(N)), if your data has been preloaded in the closest possible caches, and a worst-case-complexity of O(1) if your data is stuck in an outer level cache/DRAM. Given that we usually care more about the worst-case-scenarios, not the best-case-scenario, using the O(1) time complexity seems like a reasonable choice.

Going back to the author's premise that the time-complexity of a single memory access is O(sqrt(N)), not O(1), this is true only where N represents all/most of the dataset being processed. If N represents only a small fraction of the dataset being processed, and your caches are going to be mostly filled with other unrelated data, then the time complexity is closer to O(1).

Clearly the O(sqrt(N)) is more accurate than O(1) under some circumstances, but even so, it's not clear what benefit this accuracy confers. All models are inaccurate simplifications of reality, but simple-inaccurate models can still be useful if they can help in decision-making. Big-O analysis isn't used to estimate the practical running-time of an application. For that, you'd be better off just running the thing. Big-O analysis is more used to compare and decide between different competing algorithms/data-structures. And in that sense, whether you choose to model linked-lists/binary-search/hash-maps as O(Nsqrt(N))/O(log(N)sqrt(N))/O(sqrt(N)), or O(N)/O(logN)/O(1), the recommendation you end up with is the same.

reply

menaerus | karma 284 | avg karma 0.77 · | 2023-11-17 04:33:02

Searching the vector is literally incrementing a pointer over the data. The number of instructions needed to do the search is very small - e.g. ~15. This means that it can very easily fit into the CPU uOp cache but also makes it a candidate for the LSD cache. Both of those will be a major factor in hiding the latencies or getting rid of them in the CPU frontend fetch-decode pipeline, effectively making all the for-loop iterations only left to be bound by the CPU backend, or more specifically, branch-mispredictions (aka ROB flushing) and memory latencies.

Given the predictable access nature of vectors and their contiguous layout in the memory, the CPU backend will be able to take advantage of those facts and will be able to hide the memory latency (even within the L1+L2+L3 cache) by pre-fetching the data on consecutive cache-lines just as you go through the loop. Accessing the data that resides in L1 cache is ~4 cycles.

The "non-branchiness" of such code will make it predictable and as such will make it a good use of BTB buffers. Predictability will prevent the CPU from having to flush the ROB and hence flushing the whole pipeline and starting all over again. The cost of this is one of the largest there are within the CPU and it is ~15 cycles.

OTOH searching the open-addressing hashmap is the super-set of that - e.g. almost as if you're searching over an array of vectors. So, only the search code is: (1) By several factors larger, (2) Much more branchy, (3) Less predictable and (4) Less cache-friendly.

Algorithmically speaking, yes, what you're saying makes sense, but I think the whole picture can only be made once the hardware details are also taken into account. Vector approach will literally be only bound by the number of cycles it takes to fetch the data from L1 cache and I don't see that happening for a hash-map.

reply

TylerE | karma 18504 | avg karma 2.4 · | 2023-08-17 03:16:14

No. That’s only going to be performant if you can perform an indexed read. That index is not going fit in memory.

mistercow | karma 10714 | avg karma 3.06 · | 2013-07-05 19:02:35+00:00

I wonder though if you might actually do better overall with a smaller lookup table and interpolation (or even just a polynomial approximation, which can be evaluated without branching), since large lookup tables can lead to bad cache behavior.

bmm6o | karma 1739 | avg karma 1.84 · | 2013-11-14 21:30:26+00:00

I was thinking about that when I read the article too. How you've obviously done a lot of testing, but that the run-time's allocator algorithm is going to have an effect on this, and how allocating N floats in a loop will have different performance than if the floats were allocated over time (with other allocations mixed in) or if the objects were larger. Especially if the comparison function requires accessing object members from other memory locations... The magic value 3 might have a lot to do with the number of objects that fit in the cache or a page.

Anyway, it's an interesting article and an interesting library.

reply

bubblyworld | karma | avg karma · | 2024-06-01 06:26:02

It's only much better if you care about cache misses, right? I think for a 10x10 array we can safely ignore performance concerns.

LgWoodenBadger | karma 686 | avg karma 1.93 · | 2017-05-23 20:59:12+00:00

It's an interesting use-case where the data is all known up-front.

I'd be interested to see an incrementally-increasing benchmark. I'd imagine the cache-miss penalty from the HashSet is countered by the continuous re-sorting/copying/moving from the Array solution.

But that's why we measure, right?

reply

btilly | karma 52813 | avg karma 4.93 · | 2017-08-15 02:37:25

Good luck. Another benefit of this strategy is that you optimize that data structure using techniques that aren't available in higher languages. So, for instance, small trees can be set up to have all of the nodes of the tree very close together, improving the odds of a cache hit. You can switch from lots of small strings to having integers that index a lookup table of strings for display only.

The amount of work to do this is insane. Expect 10x what it took to write it in a high level language. But the performance can often be made 10-100x as well. Which is a giant payoff.

reply

waps | karma 1879 | avg karma 1.43 · | 2014-08-09 11:46:35+00:00

This does not sound accurate. Memory access dominates in most current programs, and it dominates by a factor of several hundred (meaning one memory access takes more time than 100 L1 cache hits). If you use more memory, by necessity your cache will be less efficient. In the vast majority of cases, it's not worth it.

https://gist.github.com/jboner/2841832

If you trade cpu usage for heap size, you'll lose in all modern cpus. In fact it's getting to the point where worse algorithms that are space-efficient are starting to beat the best known algorithms in practice, even if their constants are different. Dual TLB lookup when running virtualized doesn't help with that either.

If you can avoid a memory read with 4-500 assembly instructions (say a C/C++ function of half a screen), that is actually worth it. I did a test on my newest machine that showed that it's actually more efficient to run the sieve of Eratosthenes algorithm for prime numbers under ~800, than to have a table in memory with those primes.

It also means that you should easily be able to execute about a page of inlined C/C++ code in the time it takes the cpu to simply jump to a virtual function. Go and Java only have virtual functions.

Likewise, having programs work directly in compressed memory is actually a net-win. E.g. storing protobufs in memory and unpack single fields out of them when used instead of using unpacked structs directly is actually a win for even quite large structs. Not for megabyte-long ones, but you know, in 2 years there'll be a new intel micro-arch and who knows ?

Good luck keeping objects pointerless in Java. Or go for that matter (you can't use any interfaces or pointers if you do).

But don't take my word for this. Here's a trivially simple test. Gcc can compile java programs down to machine code. The memory model of those compiled things is different from the jvm. Take what you consider a slow java program, gcc it, and execute it. Note the 100%+ improvement in speed. (only really works for memory intensive stuff, of course). Rewriting it in C++ (assuming you know what you're doing) will get you another 100%+ improvement.

reply

hyperman1 | karma 6804 | avg karma 4.13 · | 2019-05-03 20:51:29+00:00

I wonder if it is worth it to create a hybrid data structure: A linked list of arrays. Say you'd tune it so each array fits in the L1 cache. You don't need a huge chunk of contiguous memory, but still gain most of the performance gains of arrays. Iteration gets a lot more complicated, of course.

jcelerier | karma 10788 | avg karma 2.29 · | 2023-11-16 03:09:43

... Or much smaller, I remember a benchmark for a case I had a few years ago, the cutoff I had for a map being faster than std::vector/array & linear probing was closer to N=10 that 100.

(Not std::map, at the time it must have been something like tsl:: hopscotch_map).

Note also that nowadays for instance boost comes with state-of-the-art flat_map and flat_unordered_map which gives both the cache coherency for small sizes and the algorithmic characteristics of various kinds of maps

reply

FreeFull | karma 2098 | avg karma 2.41 · | 2015-02-01 19:04:32+00:00

Especially since the array will probably be linear in memory, so will work well with the cache, and the tree probably will have jumps all over the RAM unless the implementation uses a backing array.

cperciva | karma 61914 | avg karma 7.51 · | 2010-06-17 21:07:22+00:00

True. Although I'd be interested in seeing actual benchmarks for interpolation vs. binary search too, since binary search has better cache locality...

zanny | karma 13629 | avg karma 2.97 · | 2012-11-13 23:58:45

Cache coherency matters, and making sure a large data structure fits in memory is important too. If you can't fit it, at least make it nicely match up to page boundaries. If you don't understandt he implementation of your choice array / set / map / tuple data structure that you are throwing around all over the place, you could be missing plenty of performance just from TLB misses or some other hardware voodoo.