Kaist develops next-generation ultra-low power LLM accelerator

moffkalast | karma 7759 | avg karma 1.88 · 2024-03-07 18:28:48

> The 4.5-mm-square chip, developed using Korean tech giant Samsung Electronics Co.'s 28 nanometer process, has 625 times less power consumption compared with global AI chip giant Nvidia's A-100 GPU, which requires 250 watts of power to process LLMs, the ministry explained.

>processes GPT-2 with an ultra-low power consumption of 400 milliwatts and a high speed of 0.4 seconds

Not sure what's the point on comparing the two, an A100 will get you a lot more speed than 2.5 tokens/sec. GPT 2 is just a 1.5B param model, a Pi 4 would get you more tokens per second with just CPU inference.

Still, I'm sure there's improvements to be made and the direction is fantastic to see, especially after Coral TPUs have proven completely useless for LLM and whisper acceleration. Hopefully it ends up as something vaguely affordable.

reply

dloss | karma 297 | avg karma 4.3 · 2024-03-07 21:02:37

Which of the model requirements of Coral TPUs [1] are the most problematic for LLMs?

[1] https://coral.ai/docs/edgetpu/models-intro/#model-requiremen...

reply

semisight | karma 361 | avg karma 4.35 · 2024-03-07 21:38:32

Guessing as to what the GP meant--coral TPUs max out around 8M parameters, IIRC. That's a few orders of magnitude less than the smallest LLM model.

moffkalast | karma 7759 | avg karma 1.88 · 2024-03-07 23:27:27

The part where they have like 3 bytes of memory so you switch from extremely high latency of RAM to laughably sluggish latency of USB serial. I think there's also no support below 8 bit quants, which you'd really need.

dartos | karma 2394 | avg karma 1.71 · 2024-03-07 18:30:01

> New structure mimics the layout of neurons and synapses

What does that mean, practically?

How can you mimic that layout in silicon?

reply

p1esk | karma 6022 | avg karma 1.5 · 2024-03-07 18:53:28

This means they use Spiking Neural Networks. It’s a software algorithm that most likely doesn’t work as well as regular NNs.

colinator | karma 28 | avg karma 1.47 · 2024-03-07 19:20:02

Well, our brains are closer to spiking neural networks than 'regular' neural networks. And they work pretty well. For the most part.

I feel like SNNs are like Brazil - they are the future, and shall remain so. I think more basic research is needed for them to mature. AFAIK the current SOTA is to train them with 'surrogate gradients', which shoe-horn them into the current NN training paradigm, and that sort of discards some of their worth. Have biologically-inspired learning rules, like STDP, _really_ been exhausted?

reply

ilaksh | karma 9227 | avg karma 1.28 · 2024-03-07 21:19:40

But this group claims to have demonstrated a way to use SNNs to run LLMs effectively and with vastly less energy usage.

p1esk | karma 6022 | avg karma 1.5 · 2024-03-07 21:44:54

If OpenAI or DeepMind makes such claim I'd pay attention. Otherwise it's always some (usually hw) guys trying to get a grant, or even just publish a paper.

p.s. People interested in biologically inspired data processing algorithms should look at Numenta's papers (earlier ones, because recently they switched to regular deep learning), and especially learn their justification for not using spikes.

reply

esafak | karma 3619 | avg karma 1.54 · 2024-03-07 23:17:03

They switched? Why? There goes their whole raison d'etre!

imtringued | karma 11098 | avg karma 0.8 · 2024-03-08 07:50:03

Spiking neural networks are not software, they are usually built directly into silicon chips because they are using pulse timing to encode information instead of multiple bits. The problem is that training them is difficult because they operate over time, not that they don't work. As of now, scaling training infrastructure is more important than theoretical power efficiency.

zachbee | karma 168 | avg karma 3.43 · 2024-03-07 23:45:53

Neuromorphic computing basically uses individual "neurons", represented with either analog or digital circuits, which communicate using asynchronous pulses called "spikes". Unlike the human brain, neuromorphic chips are 2D, but we can replicate a good amount of neural dynamics in silicon.

It's unclear how they managed to use this to run LLMs, though. Getting GPT-2 running with SNNs is a legitimate achievement, because SNNs have traditionally lagged significantly behind conventional deep learning architectures.

https://web.stanford.edu/group/brainsinsilicon/documents/ANe... https://web.stanford.edu/group/brainsinsilicon/documents/Ben...

reply

geuis | karma 12409 | avg karma 4.16 · 2024-03-07 19:52:18

Want to reference Groq.com. They are developing their own inference hardware called an LPU https://wow.groq.com/lpu-inference-engine/

They also released their API a week or 2 ago. Its significantly faster than anything from OpenAI right now. Mixtral 8x7b operates at around 500 tokens per second. https://groq.com/

reply

moffkalast | karma 7759 | avg karma 1.88 · 2024-03-07 20:12:47

It's not so much an accelerator as it is addressing the main inference bottleneck (i.e. memory latency) with sheer brute force by throwing money at the problem. They've made accelerators out of pure L3 cache with a whopping 230 MB per card. They cited something like 500 cards to load one single Mixtral instance, which probably cost over $10M to build. It's a supercomputer essentially.

jiggawatts | karma 26854 | avg karma 5.63 · 2024-03-07 21:52:55

Or to put it another way: they’ve made a compute substrate with the correct ratios of processing power to memory capacity.

NVIDIA GPUs were optimised for different workloads, such as 3D rendering, that have different optimal ratios.

This “supercomputer” isn’t brute force or wasteful because it allows more requests per second. By having each response get processed faster it can pipeline more of them through per unit time and unit silicon area.

reply

wmf | karma 46152 | avg karma 2.46 · 2024-03-07 22:03:41

The correct ratio for one workload (production inference).

cavisne | karma 1294 | avg karma 2.45 · 2024-03-07 23:29:18

A recent presentation on the architecture

https://youtu.be/WQDMKTEgQnY?si=W0E9Kq6P280l3Wcl

IMO we still need an MLPerf submission or similar to really understand if this is more efficient or more efficient only if you also want to minimize latency.

Nvidia has pulled enough rabbits out of the hat when it comes to MLPerf I’m still not convinced they can’t work some CUDA magic and undercut them on efficiency.

reply

LoganDark | karma 2715 | avg karma 1.04 · 2024-03-07 22:54:25

They need 568 LPUs to load both Mixtral 8x7B and LLaMA 70B, because they need both those models available for the demo.

I imagine Mixtral by itself would only take something like 200-300 LPUs

reply

wmf | karma 46152 | avg karma 2.46 · 2024-03-07 23:17:36

Only $5M then.

int_19h | karma 21203 | avg karma 1.69 · 2024-03-08 03:33:17

$5M once, upfront. But given the significantly increased throughput, how fast does that pay for itself?

gessha | karma 362 | avg karma 1.89 · 2024-03-08 06:22:52

Depends on power usage. I’m curious how power hungry those are compared to server/workstation cards.

fzzzy | karma 1315 | avg karma 1.53 · 2024-03-08 10:36:52

You need computers for all of them and megawatts of power, power supplies, cooling, and power distribution.

int_19h | karma 21203 | avg karma 1.69 · 2024-03-08 19:38:04

Naturally, but you need that for GPUs as well, no? What is the actual difference when running, when measured per token generated?

LoganDark | karma 2715 | avg karma 1.04 · 2024-03-08 16:30:17

I'm pretty sure $20,000 per LPU isn't actually the cost of these LPUs. I saw someone else on HN asking if $20,000 could get them something and an employee said to reach out. Which makes me think $20,000 is enough to get some sort of model running at least, even if it's not necessarily an LLM.

moralestapia | karma 5625 | avg karma 2.25 · 2024-03-07 23:28:34

Nothing wrong with that, though.

ben-schaaf | karma 3141 | avg karma 3.15 · 2024-03-08 00:44:24

> pure L3 cache with a whopping 230 MB per card

Just to put these numbers in perspective a desktop 8 core 7800x3d has 96MB of L3 cache, and the top-end 96-core Epyc 9684X has 1.15GB of L3.

reply

hackerlight | karma 2680 | avg karma 1.46 · 2024-03-08 01:52:19

What's the cost per inference relative to H100? Isn't that the number to care about?

542458 | karma 7027 | avg karma 5.96 · 2024-03-08 02:34:47

If you believe the marketing material it’s lower. Their API is the cheapest around, so either it’s true or they’re subsidizing.

hackerlight | karma 2680 | avg karma 1.46 · 2024-03-08 02:40:20

Another consideration: Even if it's slightly more expensive, that can be OK if you care about inference speed. I'd pay 50% more for GPT-4 if it could deliver results that quick.

hobofan | karma 6345 | avg karma 2.86 · 2024-03-08 08:13:25

Based on some rough ballpark conservative estimates (one server with 2 A100 at $50000; 50 tokens/s one one of those servers; so 10 of those servers), upfront cost with consumer hardware seems to be 1/10 to 1/20 of what the Groq hardware costs. I would guess that realistically cloud providers can probably achieve half to 1/3 of that price

So unless you need the fast latency of Groq, consumer hardware seems to be a lot cheaper for the same thoughput.

reply

imtringued | karma 11098 | avg karma 0.8 · 2024-03-08 07:42:01

Grayskull has 96 MB SRAM and people call it overpriced at $600 to $800. It is far more plausible that their chip costs are somewhere around $500.

zachbee | karma 168 | avg karma 3.43 · 2024-03-07 23:51:05

Neuromorphic computing is cool, but not new tech. However, using a neuromorphic spiking architecture to run LLMs seems new. Unfortunately, there doesn't seem to be a paper associated with this work, so there's no deeper information on what exactly they're doing.

imtringued | karma 11098 | avg karma 0.8 · 2024-03-08 07:46:26

I heavily doubt that they are running LLMs on this.

zachbee | karma 168 | avg karma 3.43 · 2024-03-08 18:32:06

The article says they ran GPT-2! Which isn't particularly large, but replicating a large language model with a spiking neural network seems like novel work at least.

sroussey | karma 4009 | avg karma 2.01 · 2024-03-10 00:01:42

Is their work similar to Rain.ai then?

bglazer | karma 2384 | avg karma 4.2 · 2024-03-08 02:50:26

The article says 400 milliwatt power draw.

Wolfram Alpha says thats roughly equivalent to cell phone power draw when sleeping.

reply

pavelstoev | karma 106 | avg karma 1.38 · 2024-03-08 04:32:02

We build software acceleration for LLM, effectively running smaller llama2 models at the same performance on several L4's as on 1xA100.

PeterStuer | karma 7362 | avg karma 3.12 · 2024-03-08 07:53:22

Quick shoutout to https://youtube.com/@TechTechPotato for those interested in keeping tabs on the AI hardware space. There is much more going on in this area than you would think if you only follow general media.