Hacker Read

frozenport · 2024-02-19 22:57:10

    I assume given the architecture that Groq actually doesn't get any faster for batch sizes >1

I guess if you don't have any extra junk you can pack more processing into the chip?

ChainOfFools | karma 2003 | avg karma 2.04 · | 2022-01-16 22:04:11

the only other option IIRC was Lattice C, and that was painfully slow as well. there's only so much 7.16 mhz can do.

onemoreact | karma 1247 | avg karma 2.43 · | 2011-08-24 14:00:36+00:00

I don't think that optimization helps when you have a spare register to use as a temp value.

bboreham | karma 984 | avg karma 1.8 · | 2016-07-07 11:59:58+00:00

If you use more registers as storage it can go faster. 'rep movs' uses no registers for storage; only source, destination and count.

There are many, many tricks like doing 'look ahead' to start the L1 cache bringing in the next line before we actually need it, to hide that latency.

And the article describes how there are various code paths for big blocks, short blocks, aligned, unaligned, etc.

reply

wittystick | karma 21 | avg karma 1.75 · | 2024-03-18 08:15:11

The time taken for any process is still limited by Amdahl's law. You can't make inherently serial parts any faster by throwing more cores at it.

benou | karma 365 | avg karma 5.21 · | 2023-11-21 15:39:31

Batching packets bring several benefits:

  - amortizing cache misses are you mentioned

  - better use of out-of-order, superscalar processors: by processing multiple independent packets in parallel, the processor can fill more execution units

  - enable the use of vector instructions (SSE/AVX, VMX etc): again, processing multiple independent packets in parallel means you can leverage SIMD. SIMD instructions are used pervasively in VPP

remcob | karma 822 | avg karma 4.11 · | 2021-03-24 04:40:45

IIRC, I tried this method, but did not see any performance improvement. The method replaces a single `divq` instruction with a table lookup and several multiplications. On modern processors this is no longer a worthwhile trade off.

burntsushi | karma 13683 | avg karma 4.52 · | 2017-09-20 17:07:28

> And getting back to the hardware aspect, having a single thread and a badass SIMD instruction is usually faster and easier to reason about than multiple threads and general-purpose instructions where your performance gets fetch-execute-cache-hierarchy-memory-bandwidthed to death.

ripgrep benefits from both having multiple threads and badass SIMD instructions.

reply

temac | karma 2068 | avg karma 2.19 · | 2021-06-19 13:36:03

This has been tried (not on vaporhardware Mill, I'm more thinking about e.g. Intel Itanium) and it failed.

The reason is extremely simple: a speculative OOO processor optimizes dynamically. If you switch that with static compile time optimizations, you are bound to only be as fast as before in some quite limited parameter ranges (like: number of entries in a hash table, size of an image, etc.)

reply

1amzave | karma 494 | avg karma 3.05 · | 2011-08-02 16:30:15

That runs in ~0.4s on my system (Xeon E5520, 2.27GHz), but replacing the 'digits' lookup table with simple arithmetic on the ASCII values ('0' + j%10) speeds it up to ~0.23s. Yes, L1 caches are pretty fast, but ALUs are still faster (for integer addition anyway).

Edit: This was with GCC 4.1.2, newer versions probably optimize differently, so who knows.

reply

anticensor | karma 1400 | avg karma 0.58 · | 2018-07-19 07:50:08+00:00

Current superscalar processors can skip over multiple NOPs at once.

gsg | karma 893 | avg karma 2.96 · | 2015-02-13 04:40:00+00:00

Good article, nice and meaty.

    mov    0xc(%r12,%r11,8),%eax

Interesting choice, won't this be a so-called "slow lea" on recent Intel hardware?

colejohnson66 | karma 9319 | avg karma 1.93 · | 2022-11-16 12:52:43

`MOV Rq, Iq` takes one uop with a throughput of up to four(!) per cycle on some ISAs: https://uops.info/html-instr/MOV_R64_I64.html

andreaorru | karma 643 | avg karma 9.32 · | 2020-08-13 05:48:51

The article is missing benchmarks. Less instructions != faster execution.

josefx | karma 4586 | avg karma 2.35 · | 2022-12-07 08:18:24

Those four integer instructions are generally quite fast already. So unless this operation is needed extremely often there is probably no point in dedicating hardware to it.

uryga | karma 943 | avg karma 1.51 · | 2018-05-11 23:52:36+00:00

Haven't tested it (or even read the article :) ), but maybe some kind of JIT optimization kicks in the 1000 element case (ie afer the "loop" ran enough times) but not in the others?

MichaelGG | karma 17386 | avg karma 2.49 · | 2014-09-06 19:49:55+00:00

It's not guaranteed to defeat the optimization. For instance, it could just read fill into two registers and do the comparison there.

vardump | karma 7011 | avg karma 2.24 · | 2016-06-16 10:16:23+00:00

> Still, we're talking ~9 cycles per indirect call on this hardware.

In 9 cycles, you could execute up to 18 SIMD instructions, each operating on 256 bits worth of data.

For scalar, 9 cycles can be up to 30 instructions or so. For good code, average about 15-20.

> If you call each callback per buffer instead of per sample (which everyone seems to do), you're down to an amortized cost below 1 cycle per sample

Function pointers are perfectly ok even in high performance programming, if it's not what burns the cycles. If you can batch the filter, all is good. If you can't, well, you might need to think up some other solution.

Profiling rocks. I did simply some experimental cases where I ran same test setup with dynamic function pointers and hardcoded code. I quickly noticed that hardcoded version was over an order of magnitude faster. I figured out a way to narrow the difference and ended up doing dynamic code generation.

> doing nothing is cheap!

Code you don't need to run is the fastest code you can have. Infinitely fast in fact.

reply

thethirdone | karma 1155 | avg karma 2.55 · | 2022-10-28 12:36:58

> In turn Out-of-Order execution largely displaced RISC as you could get better throughput for the same number of transistors by moving more work into the compiler.

What work does OoO execution displace to the compiler? I thought that OoO CPUs get better performance on the exact same programs compared to in order CPUs.

reply

vardump | karma 7011 | avg karma 2.24 · | 2017-10-16 15:02:55

> it is not uncommon to reduce the number of conditionals for better CPU pipelining, and lookup tables are a very common tool for this.

On modern CPUs, data dependency, such as lookup tables often cause pipeline stalls — worse pipelining.

L1 cache is at a premium as well, you rarely want to waste it to access LUTs.

You can compute a lot in 12 cycles caused by L2 hit (L1 miss). In theory up to 32 * 12 = 384 floating point operations.

reply