Hacker Read

ausbah · 2023-01-11 10:20:39

inference can still be a bottleneck i think since you usually load the whole thing into memory which is 32-64GB+ usually?

uptownfunk | karma 1954 | avg karma 1.84 · | 2024-05-25 00:29:59

Dumb q- have you profiled the inference execution? Where are the bottlenecks you are observing?

nl | karma 29762 | avg karma 2.49 · | 2024-03-29 07:53:52

This is wrong. You need big memory during inference too.

The difference there is you can use tricks like quantisation and offloading to CPU to reduce it somewhat at the cost of accuracy and/or speed.

reply

londons_explore | karma 35497 | avg karma 2.72 · | 2020-07-20 10:24:50+00:00

Yep, but running inference on it at any reasonable performance requires you to have all of it in GPU RAM - Ie. you need a cluster of ~100 high performance GPU's.

bee_rider | karma 16765 | avg karma 2.35 · | 2023-09-06 12:15:59

It is inference, so maybe it is just CPU memory?

tarruda | karma 2401 | avg karma 4.66 · | 2023-04-10 08:40:17

Did you use GPU inference? If so, how much memory is required?

spi | karma 222 | avg karma 3.83 · | 2023-03-28 12:00:01

I have not tried, but 96GB of GPU memory is plenty, for inference there should certainly be no issue. Their biggest model has 13B parameters, you should be able to run inference (float16) already with 32GB of memory.

With 96GB of memory you should also be able to fine-tune it (possibly some tricks like gradient accumulation and/or checkpointing might be needed), but you have to be ready for many days of computation...

reply

sudosysgen | karma 8787 | avg karma 1.26 · | 2022-05-18 15:49:56

Hmm, that's interesting. What kind of inference workload requires more than the 48GB of memory you'd get from 2 3090s, for example? I'm genuinely curious because I haven't ran across them and it sounds interesting

sp332 | karma 55607 | avg karma 2.75 · | 2023-06-06 15:43:23

It's the same memory footprint as inference. It's not that fast, and the paper mentions some optimizations that could still be done.

ddtaylor | karma 11964 | avg karma 5.87 · | 2024-03-18 21:04:39

Does anyone know what hardware inference can run on or memory requirements?

vvladymyrov | karma 70 | avg karma 1.17 · | 2021-07-04 18:43:14+00:00

Are you running inference on CPU or GPU?

jg6302023 | karma 2 | avg karma 1.0 · | 2024-03-26 17:26:22

nice, inference is still primarily a compute workload.

mrob | karma 9307 | avg karma 3.67 · | 2023-12-11 10:04:24

LLM inference is bottlenecked by memory bandwidth. You'll probably get identical speed with cheaper CPUs.

ColonelPhantom | karma 559 | avg karma 1.81 · | 2023-08-16 08:02:44

I think in the case of LLM inference the main bottleneck is streaming the weights from VRAM to CU/SM/EU (whatever naming your GPU vendor of choice uses).

If you're doing inference on multiple prompts at the same time by doing batching, you don't take more time in streaming. But each streamed weights gets used for, say, 32 calculations instead of 1, making better use of the GPU's compute resources.

reply

christkv | karma 1794 | avg karma 1.37 · | 2023-11-13 10:53:34

Is the limit on the speed on inference a memory bandwidth issue or compute?

etaioinshrdlu | karma 6007 | avg karma 4.09 · | 2022-06-28 01:25:08

Anyone have some stats on inference time and RAM requirements? (on specific hardware)

adam_arthur | karma 3752 | avg karma 2.45 · | 2023-04-17 11:29:36

A lot of people are running fairly powerful models directly on the CPU these days... seems like inference will not be a GPU exclusive activity going forward. Given that RAM is the main bottleneck at this point, running on CPU seems more practical for most end users

See: https://news.ycombinator.com/item?id=35602234

reply

mft_ | karma 3273 | avg karma 3.07 · | 2024-05-23 09:53:20

For decent on-device inference, you need enough memory of high-enough bandwidth and enough processing power connected to that memory.

The traditional PC architecture is ill-suited to this, which is why for most computers, a GPU (which offers all three in a single package) is currently the best approach... only at the moment, even a 4090 only offers enough memory for moderately-sized models to load and run.

The architecture that supports Apple's ARM computers is (by design or happenstance) far better suited: unified memory maximises the model that can be loaded, some higher-end options have decently-high memory bandwidth, and the architecture lets any of the different processing units access the unified memory. Their weakness is cost, and that the processors aren't powerful enough yet to compete at the top end.

So there's currently an empty middle-ground to be won, and it's interesting to watch and wait to see how it will be won. e.g.

- affordable GPUs with much larger memory for the big models? (i.e. GPUs, but optimised)

- affordable unified memory computers with more processing power (i.e. Apple's approach, but optimised)

- something else (probably on the software side) making larger model inference more efficient, either from a processing side (i.e. faster for the same output) or from a memory utilisation side (e.g. loading or streaming parts of large models to cope with smaller device memory, or smaller models for the same output)

reply

varunkmohan | karma 366 | avg karma 2.93 · | 2022-08-29 14:08:59

Agreed that there are workloads where inference is not expensive, but it's really workload dependent. For applications that run inference over large amounts of data in the computer vision space, inference ends up being a dominant portion of the spend.

godelski | karma 19371 | avg karma 2.68 · | 2023-12-03 16:46:33

Yes, if you can fit it in RAM then the loss isn't too bad if you have the memory pinned and are taking advantage of some of Nvidia's optimizations (which should be defaulted to on). Inference usually isn't too bad but training is heavier. Things have also massively improved over the last few years. The efficiency also really increases with reduced floating point types and quantized models.