I'm an ML engineer but I know nothing about the inference part. Are there that many kind of devices that optimizing for inference on a device is a thing? I thought almost everyone serves from GPUs/TPUs and hence there are only 2 major device types. What am I missing here?
It's true that inference is still very often done on CPU, or even on microcontrollers. In our view, this is in large part because many applications lack good options for inference accelerator hardware. This is what we aim to change!
Are people largely running inference on GPU/TPU? At my job, we run inference with pretty large transformers on CPU just because of how much cheaper it is.
Just to add to this, the reason these inference accelerators have become big recently (see also the "neural core" in Pixel phones) is because they help doing inference tasks in real time (lower model latency) with better power usage than a GPU.
As a concrete example, on a camera you might want to run a facial detector so the camera can automatically adjust its focus when it sees a human face. Or you might want a person detector that can detect the outline of the person in the shot, so that you can blur/change their background in something like a Zoom call. All of these applications are going to work better if you can run your model at, say, 60 HZ instead of 20 HZ. Optimizing hardware to do inference tasks like this as fast as possible with the least possible power usage it pretty different from optimizing for all the things a GPU needs to do, so you might end up with hardware that has both and uses them for different tasks.
Yep, but running inference on it at any reasonable performance requires you to have all of it in GPU RAM - Ie. you need a cluster of ~100 high performance GPU's.
For decent on-device inference, you need enough memory of high-enough bandwidth and enough processing power connected to that memory.
The traditional PC architecture is ill-suited to this, which is why for most computers, a GPU (which offers all three in a single package) is currently the best approach... only at the moment, even a 4090 only offers enough memory for moderately-sized models to load and run.
The architecture that supports Apple's ARM computers is (by design or happenstance) far better suited: unified memory maximises the model that can be loaded, some higher-end options have decently-high memory bandwidth, and the architecture lets any of the different processing units access the unified memory. Their weakness is cost, and that the processors aren't powerful enough yet to compete at the top end.
So there's currently an empty middle-ground to be won, and it's interesting to watch and wait to see how it will be won. e.g.
- affordable GPUs with much larger memory for the big models? (i.e. GPUs, but optimised)
- affordable unified memory computers with more processing power (i.e. Apple's approach, but optimised)
- something else (probably on the software side) making larger model inference more efficient, either from a processing side (i.e. faster for the same output) or from a memory utilisation side (e.g. loading or streaming parts of large models to cope with smaller device memory, or smaller models for the same output)
This also very much depends on the inference use case / context. For example, I work in deep learning on digital pathology where images can be up to 100000x100000pixels in size and inference needs GPUs as it's just way too slow otherwise.
ML inference is not magic: by and large, it's just a combination of simple operations like matrix multiplications/dot products, element-wise nonlinearities, convolutions and other stuff that vector processors, GPUs and increasingly CPUs (thanks to SIMD) are very well optimized for. (In theory one could optimize a chip for some specific, well-defined ML architecture, even to the point of "wiring" the architecture in hardware, and people used to do such things back in the 1980s when this was needed in order to even experiment with e.g. neural network models. But given how fast ML is progressing these days, there's just no reason for doing anything like that nowadays!)
But can't you configure the device to do e.g. fast matrix-vector multiplications instead of inference? I can be wrong, but I suspect that's what people do mostly on supercomputers anyway.
To run inference on GPUs, people are typically using TensorRT or a similarly-optimized engine. That can make a big difference in cost tradeoffs vs. CPU. Ultimately, if you can keep a GPU reasonably well-fed, the GPU can come out much cheaper and lower latency. If your workload is very sporadic and infrequent, YMMV.
Inference is mostly just matrix multiplications, so there's plenty of competitors.
Problem is, inference costs do not dominate training costs. Models have a very limited lifespan, they are constantly retrained or obsoleted by new generations, so training is always going on.
Training is not just matrix multiplications, given hundreds of experiments in model architecture, its not even obvious what operations will dominate future training. So a more general purpose GPU is just a way safer bet.
Also, LLM talent is in extreme short supply, and you don't want to piss them off by telling them they have to spend their time debugging some crappy FPGA because you wanted to save some hardware bucks.
For applications that aren't latency sensitive, I run inference on a free 4 core Ampere server from Oracle. Once you ditch the "fast" prerequisite, a lot of hardware becomes viable.
Hmm, that's interesting. What kind of inference workload requires more than the 48GB of memory you'd get from 2 3090s, for example? I'm genuinely curious because I haven't ran across them and it sounds interesting
Inference is also becoming a bigger contributor to compute bills, especially as models get bigger. With big models like GPT-2, its not unheard of for teams to scale up to hundreds of GPU instances to handle a surprisingly small number of concurrent users. Things can get expensive pretty quick.
Hey man, thanks for the article. I like it that it is concise and simple. One thing that's not clear to me is about the inference stage: where does this inference process runs? Do you need to run it in a GPU powered instance or could it be runned in a consumer laptop?
reply