Hacker Read top | best | new | newcomments | leaders | about | bookmarklet login

What's the cost per inference relative to H100? Isn't that the number to care about?


sort by: page size:

This is where inference speed starts to matter. H100 might be cheaper per inference than Groq but cutting down the wait time from 1 minute to 10 seconds could be a big deal.

Inference costs more.

Yes, only interested in the inference cost.

Id bet $50 the inference is more expensive

Is inference really that expensive? Anyway if the price is too low they could easily charge by query

Ah, that's fair, and faster than any of the LMDeploy stats for batch size 1; nice work!

Using an H100 for inference, especially without batching, sounds awfully expensive. Is cost much of a concern for you right now?


Some kinds of inference are expensive, yes, not going to dispute that. But 99.95% of it is actually surprisingly inexpensive. Hell, a lot of useful workloads can be deployed on a cell phone nowadays, and that fraction will increase over time, further reducing inference costs or eliminating them outright (or rather moving them to the consumer).

For the vast majority of people the main expense is creating the combination of a dataset and model that works for their practical problem, with the dataset being the harder (and sometimes more expensive) problem of the two.

The dataset is also their "moat", even though most of them don't realize it, and don't put enough care into that part of the pipeline.


Thanks! Yes inference costs are non-negligible right now but we think this will come down over time

Is it possible that inference cost is so high it’s viable?

Shouldn't inference be 99.999% of the compute cost over time? Especially for MS. Look how many Copilots they are cramming into their products

Are you referring to the (one-time) training cost or the cost per inferred token? The latter is pretty acceptable these days, especially with smaller models.

Agreed that there are workloads where inference is not expensive, but it's really workload dependent. For applications that run inference over large amounts of data in the computer vision space, inference ends up being a dominant portion of the spend.

haha a costly demo to run with the hn hug + inference costs..

Also missed in the post is fp8 is really much more efficient

The H100s are actually very good for inference..


Isn’t this already the case? It’s probably still cheaper considering the inference costs.

Scaled inference isn't cheap either :/

Inference compute costs and training compute costs aren’t the same. Training costs are an order of magnitude higher.

They're saying with this architecture there's a tradeoff between training and inference cost where a 10x smaller model (much cheaper to run inference) can match a bigger model if the smaller is trained on 100x data (much more expensive to train) and that the improvement continues log-linearly.

Inference costs are non-trivial, and I wouldn’t be surprised if the cost of running ChatGPT (given the 3M/day figure) has surpassed that of training it. Without optimizations, training only uses ~3 times the memory as inference, so exponential parameter/cost scaling still affects both.

There’s ongoing research to reduce the computational costs of inference, but to my knowledge they only offer linear improvements (although I wouldn’t bet against more substantial reductions in the near future, particularly as these techniques are compounded).

next

Legal | privacy