Hacker Read top | best | new | newcomments | leaders | about | bookmarklet login

A lot of people are running fairly powerful models directly on the CPU these days... seems like inference will not be a GPU exclusive activity going forward. Given that RAM is the main bottleneck at this point, running on CPU seems more practical for most end users

See: https://news.ycombinator.com/item?id=35602234



sort by: page size:

Are you running inference on CPU or GPU?

Yep, but running inference on it at any reasonable performance requires you to have all of it in GPU RAM - Ie. you need a cluster of ~100 high performance GPU's.

Are people largely running inference on GPU/TPU? At my job, we run inference with pretty large transformers on CPU just because of how much cheaper it is.

Just from the abstract, this is primarily for batching inference, for batched inference using GPUs gives an order of magnitude speed increase so probably not something that usually makes sense to do on CPUs…

This is great! CPU only for now or can it leverage GPU for sped up inference?

I thought they don't use the gpu for inference?

The link only mentions requirements to run inference at "decent speeds", without going into details about what they consider to be decent speeds.

In principle you can of course run any model on any hardware that has enough RAM. Whether the inference performance is acceptable depends your particular application.

I'd argue that for most non-interactive use cases, inference speed doesn't really matter and the cost benefit from running on CPUs vs GPUs might be worth it.


Inference will not work on cpu, I see 100 percent cpu frequently when doing cpu inference and we had to change to use GPUs for a large client.

It's true that inference is still very often done on CPU, or even on microcontrollers. In our view, this is in large part because many applications lack good options for inference accelerator hardware. This is what we aim to change!

inference can still be a bottleneck i think since you usually load the whole thing into memory which is 32-64GB+ usually?

How realistic is CPU-only inference in the near future?

For decent on-device inference, you need enough memory of high-enough bandwidth and enough processing power connected to that memory.

The traditional PC architecture is ill-suited to this, which is why for most computers, a GPU (which offers all three in a single package) is currently the best approach... only at the moment, even a 4090 only offers enough memory for moderately-sized models to load and run.

The architecture that supports Apple's ARM computers is (by design or happenstance) far better suited: unified memory maximises the model that can be loaded, some higher-end options have decently-high memory bandwidth, and the architecture lets any of the different processing units access the unified memory. Their weakness is cost, and that the processors aren't powerful enough yet to compete at the top end.

So there's currently an empty middle-ground to be won, and it's interesting to watch and wait to see how it will be won. e.g.

- affordable GPUs with much larger memory for the big models? (i.e. GPUs, but optimised)

- affordable unified memory computers with more processing power (i.e. Apple's approach, but optimised)

- something else (probably on the software side) making larger model inference more efficient, either from a processing side (i.e. faster for the same output) or from a memory utilisation side (e.g. loading or streaming parts of large models to cope with smaller device memory, or smaller models for the same output)


Not sure why this post is getting flagged to oblivion, but it is technically pretty interesting.

GPUs remain expensive, and their use is typically prioritized for training, and then inference is run on CPUs. This provides a cost effective way to attach GPU resources on demand to regular instances, rather than having to run dedicated GPU instances.


Inference on GPU is already very slow on the full-scale non-distilled model (in the 1-2 sec range iirc), on CPU it would be an order of magnitude more.

Cannot one do inference using a CPU?

Well... I feel like people are only focusing on training. CPUs for inference are sometimes helpful in online inference settings where you don't ever really "batch" requests. In this case, they can be cheap.

For applications that aren't latency sensitive, I run inference on a free 4 core Ampere server from Oracle. Once you ditch the "fast" prerequisite, a lot of hardware becomes viable.

The model itself is hardware-agnostic, so there's nothing preventing someone from building a frontend for their platform of choice.

Granted, powerful hardware is still required to run inference at acceptable speeds (or at all - I don't know the memory requirements).


Wow, this is really cool! Is inference running on GPU or CPU? I wonder how well this would run if model storage and inference is done on a jetson nano!
next

Legal | privacy