Hacker Read top | best | new | newcomments | leaders | about | bookmarklet login

Our interconnect between chips is also deterministic! You can read more about our interconnect, synchronisation, and error correction in our paper.

https://wow.groq.com/wp-content/uploads/2023/05/GroqISCAPape...



sort by: page size:

You can find out about the chip to chip interconnect from our paper below, section 2.3. I don't think that's custom.

We achieve low latency by basically being a software-defined architecture. Our functional units operate completely orthoganal to each other. We don't have to batch in order to achieve parallelism and the system behaviour is completely deterministic, so we can schedule all operations precisely.

https://wow.groq.com/wp-content/uploads/2023/05/GroqISCAPape...


The Groq AI chip startup has solved this problem. They don't use hand written kernels at all, instead they use a compiler, and they have the top speed in the world on LLaMA2-70B, 240tokens/s.

https://www.youtube.com/@GroqInc/videos

Other interesting Groq tidbits - their models are deterministic, the whole system up to thousands of chips runs in sync on the same clock, memory access and network are directly controlled without any caches or intermediaries so they also run deterministically.

That speeds up communication and allows automatic synchronisation across thousands of chips running as one single large chip. The compiler does all the orchestration/optimisation. They can predict the exact performance of an architecture from compile time.

What makes Groq different is that they started from the compiler, and only later designed the hardware.


I recall reading some decades ago about dividing chips into separate domains that would commuicate asynchronously, precisely to shorten the longest path and increase the max frequency.

It's my understanding modern processors do indeed do this to varying degrees.


The premise of simplifying architecture, focus on memory, reliance on software, even the fact that you can stack a ton of chips per nose, all sounds very much like Groq. I wonder if this another case of multiple discovery.

It's really extraordinary how tightly coupled modern innovation in scientific fields is to processor implementations. I suspect you and I share a keen interest in the path by which we got to this enviable situation.

The most interesting thing about the FMxxxx chips was the pipeline was async.

It's an impressive feat of scaling by the engineers at Intel et al. :)

It's always fascinating to see how complex modern hardware actually is. Dozens of small processors with their own operating systems communicating with each other. I have a lot of respect for the engineers that they actually managed to build a decide so small yet so complex.

And the CPU and the chipset's cooperation. In many cases it is even a trade secret how data is striped across the different slots/banks.

IIRC, you can optimise the locations of sub-components to minimise on-chip communication delays. The automatic algorithms are pretty good, but can often be bettered (particularly if critical paths are identified and optimised).

> You can actually map out the physical processor layout, including on a single die, based on the latency between these cores. It's quite subtle and requires low noise, but it's really cool to map out the grid of cores on the actual silicon due to timing.

This is a very cool comment in general, but I'm intrigued by this bit in particular. I'd love to see an implementation, if anyone knows of any.


There is roughly equal levels of complexity in any modern, high performance, general purpose/programmable chip.

That is true. Scaling this to processor with caches and deep pipelines is completely non-trivial. There has been some attempt to scale to more "realistic" designs with compositional (stepwise) verification.

http://www.ccs.neu.edu/home/pete/research/ieee-vlsi-composit...

http://www.ccs.neu.edu/home/pete/research/todaes-safety-live...


Remember that this work applies generically to Mali chips using the Bifrost architecture.

Isn't this where Mr Moore's (of Forth fame) cpu designs have gone? (see https://www.greenarraychips.com).

Ok maybe not the 'actor' model per say (unless actor channels are thought of core interconnects (I think there is also a global state but I cant remember much of these details at the moment without re-watching the linked strangloop vid.)

https://www.youtube.com/watch?v=0PclgBd6_Zs&t=2s or https://news.ycombinator.com/item?id=22021262 is probably the best for groking


Yeah, it's an interesting design challenge. It seems like "design a SoC that performs exactly like discrete chips connected by a bus" is a task that'd have a lot of sneaky pitfalls. Partly depends on how close "exactly" has to be.

As a software person, I'm most impressed by CPU hardware -- all the crazy speculative out-of-order dynamic translation stuff that goes on at GHz timescales. It's astonishing that it works at all, let alone that it's almost 100% reliable.

The hope is that a standard will be created soon to support this behaviour. These sorts of things are complicated because it requires support of the chipset vendors. i.e Intel and AMD.

I'm working (as a consultant) for a company that does state-of-the-art chip design. The complexity is mind-boggling. The number of different types of core components isn't very high; at root it's all transistors and (nominally) rectangular pieces of metal. But these simple building blocks interact with each other in ridiculously complicated ways. Just processing the files that contain the timing information for the current generation of chip fabrication technology takes over an hour on a fully-tricked-out server. Checking a design to see if it meets a timing spec takes many hours using dozens of servers. And then you have to deal with power, thermal constraints, geometric design rule checks (the list of rules you have to follow is a PDF document hundreds of pages long), clock distribution... Frankly, it amazes me that state-of-the-art chips work at all.
next

Legal | privacy