Hacker Read top | best | new | newcomments | leaders | about | bookmarklet login

    I assume given the architecture that Groq actually doesn't get any faster for batch sizes >1
I guess if you don't have any extra junk you can pack more processing into the chip?


sort by: page size:

the only other option IIRC was Lattice C, and that was painfully slow as well. there's only so much 7.16 mhz can do.

I don't think that optimization helps when you have a spare register to use as a temp value.

If you use more registers as storage it can go faster. 'rep movs' uses no registers for storage; only source, destination and count.

There are many, many tricks like doing 'look ahead' to start the L1 cache bringing in the next line before we actually need it, to hide that latency.

And the article describes how there are various code paths for big blocks, short blocks, aligned, unaligned, etc.


The time taken for any process is still limited by Amdahl's law. You can't make inherently serial parts any faster by throwing more cores at it.

Batching packets bring several benefits:

  - amortizing cache misses are you mentioned

  - better use of out-of-order, superscalar processors: by processing multiple independent packets in parallel, the processor can fill more execution units

  - enable the use of vector instructions (SSE/AVX, VMX etc): again, processing multiple independent packets in parallel means you can leverage SIMD. SIMD instructions are used pervasively in VPP

IIRC, I tried this method, but did not see any performance improvement. The method replaces a single `divq` instruction with a table lookup and several multiplications. On modern processors this is no longer a worthwhile trade off.

> And getting back to the hardware aspect, having a single thread and a badass SIMD instruction is usually faster and easier to reason about than multiple threads and general-purpose instructions where your performance gets fetch-execute-cache-hierarchy-memory-bandwidthed to death.

ripgrep benefits from both having multiple threads and badass SIMD instructions.


This has been tried (not on vaporhardware Mill, I'm more thinking about e.g. Intel Itanium) and it failed.

The reason is extremely simple: a speculative OOO processor optimizes dynamically. If you switch that with static compile time optimizations, you are bound to only be as fast as before in some quite limited parameter ranges (like: number of entries in a hash table, size of an image, etc.)


That runs in ~0.4s on my system (Xeon E5520, 2.27GHz), but replacing the 'digits' lookup table with simple arithmetic on the ASCII values ('0' + j%10) speeds it up to ~0.23s. Yes, L1 caches are pretty fast, but ALUs are still faster (for integer addition anyway).

Edit: This was with GCC 4.1.2, newer versions probably optimize differently, so who knows.


Current superscalar processors can skip over multiple NOPs at once.

Good article, nice and meaty.

    mov    0xc(%r12,%r11,8),%eax
Interesting choice, won't this be a so-called "slow lea" on recent Intel hardware?

`MOV Rq, Iq` takes one uop with a throughput of up to four(!) per cycle on some ISAs: https://uops.info/html-instr/MOV_R64_I64.html

The article is missing benchmarks. Less instructions != faster execution.

Those four integer instructions are generally quite fast already. So unless this operation is needed extremely often there is probably no point in dedicating hardware to it.

Haven't tested it (or even read the article :) ), but maybe some kind of JIT optimization kicks in the 1000 element case (ie afer the "loop" ran enough times) but not in the others?

It's not guaranteed to defeat the optimization. For instance, it could just read fill into two registers and do the comparison there.

> Still, we're talking ~9 cycles per indirect call on this hardware.

In 9 cycles, you could execute up to 18 SIMD instructions, each operating on 256 bits worth of data.

For scalar, 9 cycles can be up to 30 instructions or so. For good code, average about 15-20.

> If you call each callback per buffer instead of per sample (which everyone seems to do), you're down to an amortized cost below 1 cycle per sample

Function pointers are perfectly ok even in high performance programming, if it's not what burns the cycles. If you can batch the filter, all is good. If you can't, well, you might need to think up some other solution.

Profiling rocks. I did simply some experimental cases where I ran same test setup with dynamic function pointers and hardcoded code. I quickly noticed that hardcoded version was over an order of magnitude faster. I figured out a way to narrow the difference and ended up doing dynamic code generation.

> doing nothing is cheap!

Code you don't need to run is the fastest code you can have. Infinitely fast in fact.


> In turn Out-of-Order execution largely displaced RISC as you could get better throughput for the same number of transistors by moving more work into the compiler.

What work does OoO execution displace to the compiler? I thought that OoO CPUs get better performance on the exact same programs compared to in order CPUs.


> it is not uncommon to reduce the number of conditionals for better CPU pipelining, and lookup tables are a very common tool for this.

On modern CPUs, data dependency, such as lookup tables often cause pipeline stalls — worse pipelining.

L1 cache is at a premium as well, you rarely want to waste it to access LUTs.

You can compute a lot in 12 cycles caused by L2 hit (L1 miss). In theory up to 32 * 12 = 384 floating point operations.

next

Legal | privacy