Hacker Read

tome · 2024-02-19 21:54:10

Our interconnect between chips is also deterministic! You can read more about our interconnect, synchronisation, and error correction in our paper.

https://wow.groq.com/wp-content/uploads/2023/05/GroqISCAPape...

reply

tome | karma 8885 | avg karma 1.75 · | 2024-02-19 18:42:33

You can find out about the chip to chip interconnect from our paper below, section 2.3. I don't think that's custom.

We achieve low latency by basically being a software-defined architecture. Our functional units operate completely orthoganal to each other. We don't have to batch in order to achieve parallelism and the system behaviour is completely deterministic, so we can schedule all operations precisely.

https://wow.groq.com/wp-content/uploads/2023/05/GroqISCAPape...

reply

visarga | karma 12425 | avg karma 1.65 · | 2023-10-17 12:17:52

The Groq AI chip startup has solved this problem. They don't use hand written kernels at all, instead they use a compiler, and they have the top speed in the world on LLaMA2-70B, 240tokens/s.

https://www.youtube.com/@GroqInc/videos

Other interesting Groq tidbits - their models are deterministic, the whole system up to thousands of chips runs in sync on the same clock, memory access and network are directly controlled without any caches or intermediaries so they also run deterministically.

That speeds up communication and allows automatic synchronisation across thousands of chips running as one single large chip. The compiler does all the orchestration/optimisation. They can predict the exact performance of an architecture from compile time.

What makes Groq different is that they started from the compiler, and only later designed the hardware.

reply

magicalhippo | karma 12317 | avg karma 2.47 · | 2024-01-20 00:56:48

I recall reading some decades ago about dividing chips into separate domains that would commuicate asynchronously, precisely to shorten the longest path and increase the max frequency.

It's my understanding modern processors do indeed do this to varying degrees.

reply

huevosabio | karma 2318 | avg karma 6.42 · | 2024-05-30 01:11:02

The premise of simplifying architecture, focus on memory, reliance on software, even the fact that you can stack a ton of chips per nose, all sounds very much like Groq. I wonder if this another case of multiple discovery.

dekhn | karma 28741 | avg karma 2.63 · | 2020-07-21 11:40:45

It's really extraordinary how tightly coupled modern innovation in scientific fields is to processor implementations. I suspect you and I share a keen interest in the path by which we got to this enviable situation.

myrandomcomment | karma 3739 | avg karma 3.06 · | 2017-09-15 23:51:48+00:00

The most interesting thing about the FMxxxx chips was the pipeline was async.

jacques_chester | karma 23379 | avg karma 2.58 · | 2013-05-19 02:13:17+00:00

It's an impressive feat of scaling by the engineers at Intel et al. :)

LeanderK | karma 2281 | avg karma 2.49 · | 2021-10-01 06:16:30

It's always fascinating to see how complex modern hardware actually is. Dozens of small processors with their own operating systems communicating with each other. I have a lot of respect for the engineers that they actually managed to build a decide so small yet so complex.

AshamedCaptain | karma 6226 | avg karma 3.11 · | 2021-11-15 13:05:54

And the CPU and the chipset's cooperation. In many cases it is even a trade secret how data is striped across the different slots/banks.

stordoff | karma 4216 | avg karma 1.9 · | 2012-09-25 17:00:33+00:00

IIRC, you can optimise the locations of sub-components to minimise on-chip communication delays. The automatic algorithms are pretty good, but can often be bettered (particularly if critical paths are identified and optimised).

troad | karma 1922 | avg karma 4.12 · | 2024-06-19 04:52:10

> You can actually map out the physical processor layout, including on a single die, based on the latency between these cores. It's quite subtle and requires low noise, but it's really cool to map out the grid of cores on the actual silicon due to timing.

This is a very cool comment in general, but I'm intrigued by this bit in particular. I'd love to see an implementation, if anyone knows of any.

reply

bronxbomber92 | karma 260 | avg karma 1.9 · | 2020-09-18 22:32:01+00:00

There is roughly equal levels of complexity in any modern, high performance, general purpose/programmable chip.

mrefj | karma 2 | avg karma 0.22 · | 2019-02-17 18:11:06+00:00

That is true. Scaling this to processor with caches and deep pipelines is completely non-trivial. There has been some attempt to scale to more "realistic" designs with compositional (stepwise) verification.

http://www.ccs.neu.edu/home/pete/research/ieee-vlsi-composit...

http://www.ccs.neu.edu/home/pete/research/todaes-safety-live...

reply

stefan_ | karma 13798 | avg karma 4.69 · | 2020-06-08 12:19:18+00:00

Remember that this work applies generically to Mali chips using the Bifrost architecture.

bjconlan | karma 102 | avg karma 0.93 · | 2023-01-01 07:12:36

Isn't this where Mr Moore's (of Forth fame) cpu designs have gone? (see https://www.greenarraychips.com).

Ok maybe not the 'actor' model per say (unless actor channels are thought of core interconnects (I think there is also a global state but I cant remember much of these details at the moment without re-watching the linked strangloop vid.)

https://www.youtube.com/watch?v=0PclgBd6_Zs&t=2s or https://news.ycombinator.com/item?id=22021262 is probably the best for groking

reply

_delirium | karma 42153 | avg karma 4.32 · | 2010-08-24 08:01:15+00:00

Yeah, it's an interesting design challenge. It seems like "design a SoC that performs exactly like discrete chips connected by a bus" is a task that'd have a lot of sneaky pitfalls. Partly depends on how close "exactly" has to be.

iainmerrick | karma 5858 | avg karma 2.53 · | 2017-10-27 16:45:34+00:00

As a software person, I'm most impressed by CPU hardware -- all the crazy speculative out-of-order dynamic translation stuff that goes on at GHz timescales. It's astonishing that it works at all, let alone that it's almost 100% reliable.

jpgvm | karma 9485 | avg karma 4.45 · | 2014-11-11 10:09:57

The hope is that a standard will be created soon to support this behaviour. These sorts of things are complicated because it requires support of the chipset vendors. i.e Intel and AMD.

lisper | karma 54803 | avg karma 4.63 · | 2018-11-09 16:39:33+00:00

I'm working (as a consultant) for a company that does state-of-the-art chip design. The complexity is mind-boggling. The number of different types of core components isn't very high; at root it's all transistors and (nominally) rectangular pieces of metal. But these simple building blocks interact with each other in ridiculously complicated ways. Just processing the files that contain the timing information for the current generation of chip fabrication technology takes over an hour on a fully-tricked-out server. Checking a design to see if it meets a timing spec takes many hours using dozens of servers. And then you have to deal with power, thermal constraints, geometric design rule checks (the list of rules you have to follow is a PDF document hundreds of pages long), clock distribution... Frankly, it amazes me that state-of-the-art chips work at all.