This is where inference speed starts to matter. H100 might be cheaper per inference than Groq but cutting down the wait time from 1 minute to 10 seconds could be a big deal.
Some kinds of inference are expensive, yes, not going to dispute that. But 99.95% of it is actually surprisingly inexpensive. Hell, a lot of useful workloads can be deployed on a cell phone nowadays, and that fraction will increase over time, further reducing inference costs or eliminating them outright (or rather moving them to the consumer).
For the vast majority of people the main expense is creating the combination of a dataset and model that works for their practical problem, with the dataset being the harder (and sometimes more expensive) problem of the two.
The dataset is also their "moat", even though most of them don't realize it, and don't put enough care into that part of the pipeline.
Are you referring to the (one-time) training cost or the cost per inferred token? The latter is pretty acceptable these days, especially with smaller models.
Agreed that there are workloads where inference is not expensive, but it's really workload dependent. For applications that run inference over large amounts of data in the computer vision space, inference ends up being a dominant portion of the spend.
They're saying with this architecture there's a tradeoff between training and inference cost where a 10x smaller model (much cheaper to run inference) can match a bigger model if the smaller is trained on 100x data (much more expensive to train) and that the improvement continues log-linearly.
Inference costs are non-trivial, and I wouldn’t be surprised if the cost of running ChatGPT (given the 3M/day figure) has surpassed that of training it. Without optimizations, training only uses ~3 times the memory as inference, so exponential parameter/cost scaling still affects both.
There’s ongoing research to reduce the computational costs of inference, but to my knowledge they only offer linear improvements (although I wouldn’t bet against more substantial reductions in the near future, particularly as these techniques are compounded).
reply