It seems like you are making general purpose chips to run many models. Are we at a stage where we can consider taping out inference networks directly propagating the weights as constants in the RTL design?
Are chips and models obsoleted on roughly the same timelines?
We build out large systems where we stream in the model weights to the system once and then run multiple inferences on it. We don't really recommend streaming model weights repeatedly onto the chip because you'll lose the benefits of low latency.
The "even for inference" thing has turned into a bit of a trap imo.
Data parallel models scaled up for training and then could run on individual chips, but these massive model parallel models require a couple of chips directly linked together even to do inference.
So the idea that a competitor could come in with a simple, cheap inference chip doesn't really work.
We are already seeing chips for inference, really. It's how these models are getting into the consumer market. A lot of the big phones have an inference chip (tensor, neural core, etc), TV are getting them, most GPUs have some stuff dedicated for inference (DSS and superres).
Do you know any production Bayesian inference that really needs specialized chips?
The problem with NN inference / training is that they are eating up the datacenters.
At the same time you can't achieve 20x speedup compared to GPUs if you need 32 bit floats, because in that case your relative energy utilization is not that bad.
It's true that inference is still very often done on CPU, or even on microcontrollers. In our view, this is in large part because many applications lack good options for inference accelerator hardware. This is what we aim to change!
ML inference is not magic: by and large, it's just a combination of simple operations like matrix multiplications/dot products, element-wise nonlinearities, convolutions and other stuff that vector processors, GPUs and increasingly CPUs (thanks to SIMD) are very well optimized for. (In theory one could optimize a chip for some specific, well-defined ML architecture, even to the point of "wiring" the architecture in hardware, and people used to do such things back in the 1980s when this was needed in order to even experiment with e.g. neural network models. But given how fast ML is progressing these days, there's just no reason for doing anything like that nowadays!)
Inference is mostly just matrix multiplications, so there's plenty of competitors.
Problem is, inference costs do not dominate training costs. Models have a very limited lifespan, they are constantly retrained or obsoleted by new generations, so training is always going on.
Training is not just matrix multiplications, given hundreds of experiments in model architecture, its not even obvious what operations will dominate future training. So a more general purpose GPU is just a way safer bet.
Also, LLM talent is in extreme short supply, and you don't want to piss them off by telling them they have to spend their time debugging some crappy FPGA because you wanted to save some hardware bucks.
Inference-only hardware like this could be a temporary cost saving solution to scale up AI infrastructure, but I think a chip the size of this should be able to support training, otherwise it’s just a waste of money and energy. I predict inference will move to edge computing based on mobile chips like the ones from Qualcomm in the midterm.
For inference, models should be compiled down directly to WASM or WebGPU or whatever, right? The driving language really shouldn't matter at the end of the day.
Unless you've got massive compute bound transformers or old-school full convolutions, if you're interpreting a list of operations you're going to lose perf.
Chips optimized to perform the type of calculations used for NN inference at high parallelism. A good example would be the google spinoff https://coral.ai/ (though their usecase is highly limited by sub-par software constraints)
Is software that important on the inference side, assuming all the key ops are supported by the compiler? Once the model is quantized and frozen the deployment to alternative chips while somewhat cumbersome hasn’t been too challenging, at least in my experience with Qualcomm NPU deployment (trained on NVIDIA)
Future?
Most modern CPUs do this to some extent. Remember you can only have a very simple model here, because the inference time has to be extremely fast.
Compact, low-power chips to run ML inference on stuff like vision, image generation, high-quality voice synthesis, maybe even translation and LLM stuff, would be very welcome.
It is interesting that one can write code (for which certain computational/logic structures are automatically inferred) describing hardware to run inference models, while inferring that such a piece of computing power will be useful to infer the future.
I'm an ML engineer but I know nothing about the inference part. Are there that many kind of devices that optimizing for inference on a device is a thing? I thought almost everyone serves from GPUs/TPUs and hence there are only 2 major device types. What am I missing here?
Are chips and models obsoleted on roughly the same timelines?
reply