Yep, but running inference on it at any reasonable performance requires you to have all of it in GPU RAM - Ie. you need a cluster of ~100 high performance GPU's.
I have not tried, but 96GB of GPU memory is plenty, for inference there should certainly be no issue. Their biggest model has 13B parameters, you should be able to run inference (float16) already with 32GB of memory.
With 96GB of memory you should also be able to fine-tune it (possibly some tricks like gradient accumulation and/or checkpointing might be needed), but you have to be ready for many days of computation...
Hmm, that's interesting. What kind of inference workload requires more than the 48GB of memory you'd get from 2 3090s, for example? I'm genuinely curious because I haven't ran across them and it sounds interesting
I think in the case of LLM inference the main bottleneck is streaming the weights from VRAM to CU/SM/EU (whatever naming your GPU vendor of choice uses).
If you're doing inference on multiple prompts at the same time by doing batching, you don't take more time in streaming. But each streamed weights gets used for, say, 32 calculations instead of 1, making better use of the GPU's compute resources.
A lot of people are running fairly powerful models directly on the CPU these days... seems like inference will not be a GPU exclusive activity going forward. Given that RAM is the main bottleneck at this point, running on CPU seems more practical for most end users
For decent on-device inference, you need enough memory of high-enough bandwidth and enough processing power connected to that memory.
The traditional PC architecture is ill-suited to this, which is why for most computers, a GPU (which offers all three in a single package) is currently the best approach... only at the moment, even a 4090 only offers enough memory for moderately-sized models to load and run.
The architecture that supports Apple's ARM computers is (by design or happenstance) far better suited: unified memory maximises the model that can be loaded, some higher-end options have decently-high memory bandwidth, and the architecture lets any of the different processing units access the unified memory. Their weakness is cost, and that the processors aren't powerful enough yet to compete at the top end.
So there's currently an empty middle-ground to be won, and it's interesting to watch and wait to see how it will be won. e.g.
- affordable GPUs with much larger memory for the big models? (i.e. GPUs, but optimised)
- affordable unified memory computers with more processing power (i.e. Apple's approach, but optimised)
- something else (probably on the software side) making larger model inference more efficient, either from a processing side (i.e. faster for the same output) or from a memory utilisation side (e.g. loading or streaming parts of large models to cope with smaller device memory, or smaller models for the same output)
Agreed that there are workloads where inference is not expensive, but it's really workload dependent. For applications that run inference over large amounts of data in the computer vision space, inference ends up being a dominant portion of the spend.
Yes, if you can fit it in RAM then the loss isn't too bad if you have the memory pinned and are taking advantage of some of Nvidia's optimizations (which should be defaulted to on). Inference usually isn't too bad but training is heavier. Things have also massively improved over the last few years. The efficiency also really increases with reduced floating point types and quantized models.
reply