- amortizing cache misses are you mentioned
- better use of out-of-order, superscalar processors: by processing multiple independent packets in parallel, the processor can fill more execution units
- enable the use of vector instructions (SSE/AVX, VMX etc): again, processing multiple independent packets in parallel means you can leverage SIMD. SIMD instructions are used pervasively in VPP
IIRC, I tried this method, but did not see any performance improvement. The method replaces a single `divq` instruction with a table lookup and several multiplications. On modern processors this is no longer a worthwhile trade off.
> And getting back to the hardware aspect, having a single thread and a badass SIMD instruction is usually faster and easier to reason about than multiple threads and general-purpose instructions where your performance gets fetch-execute-cache-hierarchy-memory-bandwidthed to death.
ripgrep benefits from both having multiple threads and badass SIMD instructions.
This has been tried (not on vaporhardware Mill, I'm more thinking about e.g. Intel Itanium) and it failed.
The reason is extremely simple: a speculative OOO processor optimizes dynamically. If you switch that with static compile time optimizations, you are bound to only be as fast as before in some quite limited parameter ranges (like: number of entries in a hash table, size of an image, etc.)
That runs in ~0.4s on my system (Xeon E5520, 2.27GHz), but replacing the 'digits' lookup table with simple arithmetic on the ASCII values ('0' + j%10) speeds it up to ~0.23s. Yes, L1 caches are pretty fast, but ALUs are still faster (for integer addition anyway).
Edit: This was with GCC 4.1.2, newer versions probably optimize differently, so who knows.
Those four integer instructions are generally quite fast already. So unless this operation is needed extremely often there is probably no point in dedicating hardware to it.
Haven't tested it (or even read the article :) ), but maybe some kind of JIT optimization kicks in the 1000 element case (ie afer the "loop" ran enough times) but not in the others?
> Still, we're talking ~9 cycles per indirect call on this hardware.
In 9 cycles, you could execute up to 18 SIMD instructions, each operating on 256 bits worth of data.
For scalar, 9 cycles can be up to 30 instructions or so. For good code, average about 15-20.
> If you call each callback per buffer instead of per sample (which everyone seems to do), you're down to an amortized cost below 1 cycle per sample
Function pointers are perfectly ok even in high performance programming, if it's not what burns the cycles. If you can batch the filter, all is good. If you can't, well, you might need to think up some other solution.
Profiling rocks. I did simply some experimental cases where I ran same test setup with dynamic function pointers and hardcoded code. I quickly noticed that hardcoded version was over an order of magnitude faster. I figured out a way to narrow the difference and ended up doing dynamic code generation.
> doing nothing is cheap!
Code you don't need to run is the fastest code you can have. Infinitely fast in fact.
> In turn Out-of-Order execution largely displaced RISC as you could get better throughput for the same number of transistors by moving more work into the compiler.
What work does OoO execution displace to the compiler? I thought that OoO CPUs get better performance on the exact same programs compared to in order CPUs.
reply