Hacker Read

adam_arthur · 2023-04-17 11:29:36

A lot of people are running fairly powerful models directly on the CPU these days... seems like inference will not be a GPU exclusive activity going forward. Given that RAM is the main bottleneck at this point, running on CPU seems more practical for most end users

See: https://news.ycombinator.com/item?id=35602234

reply

vvladymyrov | karma 70 | avg karma 1.17 · | 2021-07-04 18:43:14+00:00

Are you running inference on CPU or GPU?

londons_explore | karma 35497 | avg karma 2.72 · | 2020-07-20 10:24:50+00:00

Yep, but running inference on it at any reasonable performance requires you to have all of it in GPU RAM - Ie. you need a cluster of ~100 high performance GPU's.

whimsicalism | karma 14467 | avg karma 2.11 · | 2020-10-02 16:45:39+00:00

Are people largely running inference on GPU/TPU? At my job, we run inference with pretty large transformers on CPU just because of how much cheaper it is.

kirill5pol | karma 179 | avg karma 2.32 · | 2023-09-14 10:47:58

Just from the abstract, this is primarily for batching inference, for batched inference using GPUs gives an order of magnitude speed increase so probably not something that usually makes sense to do on CPUs…

asselinpaul | karma 341 | avg karma 1.37 · | 2023-05-24 14:55:33

This is great! CPU only for now or can it leverage GPU for sped up inference?

shreyshnaccount | karma 657 | avg karma 1.41 · | 2022-06-28 11:02:51

I thought they don't use the gpu for inference?

qayxc | karma 3685 | avg karma 2.51 · | 2020-11-04 07:08:00

The link only mentions requirements to run inference at "decent speeds", without going into details about what they consider to be decent speeds.

In principle you can of course run any model on any hardware that has enough RAM. Whether the inference performance is acceptable depends your particular application.

I'd argue that for most non-interactive use cases, inference speed doesn't really matter and the cost benefit from running on CPUs vs GPUs might be worth it.

reply

airocker | karma 264 | avg karma 0.91 · | 2023-09-11 10:19:08

Inference will not work on cpu, I see 100 percent cpu frequently when doing cpu inference and we had to change to use GPUs for a large client.

tdba | karma 765 | avg karma 6.71 · | 2022-03-12 19:15:34

It's true that inference is still very often done on CPU, or even on microcontrollers. In our view, this is in large part because many applications lack good options for inference accelerator hardware. This is what we aim to change!

ausbah | karma 2625 | avg karma 2.83 · | 2023-01-11 10:20:39

inference can still be a bottleneck i think since you usually load the whole thing into memory which is 32-64GB+ usually?

sendfoods | karma 189 | avg karma 1.78 · | 2023-04-11 11:04:58

How realistic is CPU-only inference in the near future?

mft_ | karma 3273 | avg karma 3.07 · | 2024-05-23 09:53:20

For decent on-device inference, you need enough memory of high-enough bandwidth and enough processing power connected to that memory.

The traditional PC architecture is ill-suited to this, which is why for most computers, a GPU (which offers all three in a single package) is currently the best approach... only at the moment, even a 4090 only offers enough memory for moderately-sized models to load and run.

The architecture that supports Apple's ARM computers is (by design or happenstance) far better suited: unified memory maximises the model that can be loaded, some higher-end options have decently-high memory bandwidth, and the architecture lets any of the different processing units access the unified memory. Their weakness is cost, and that the processors aren't powerful enough yet to compete at the top end.

So there's currently an empty middle-ground to be won, and it's interesting to watch and wait to see how it will be won. e.g.

- affordable GPUs with much larger memory for the big models? (i.e. GPUs, but optimised)

- affordable unified memory computers with more processing power (i.e. Apple's approach, but optimised)

- something else (probably on the software side) making larger model inference more efficient, either from a processing side (i.e. faster for the same output) or from a memory utilisation side (e.g. loading or streaming parts of large models to cope with smaller device memory, or smaller models for the same output)

reply

jedwhite | karma 10149 | avg karma 8.47 · | 2018-11-28 17:59:55+00:00

Not sure why this post is getting flagged to oblivion, but it is technically pretty interesting.

GPUs remain expensive, and their use is typically prioritized for training, and then inference is run on CPUs. This provides a cost effective way to attach GPU resources on demand to regular instances, rather than having to run dedicated GPU instances.

reply

Voloskaya | karma 4275 | avg karma 5.52 · | 2021-01-18 18:58:36+00:00

Inference on GPU is already very slow on the full-scale non-distilled model (in the 1-2 sec range iirc), on CPU it would be an order of magnitude more.

codedokode | karma 6872 | avg karma 1.97 · | 2022-11-30 06:19:37

Cannot one do inference using a CPU?

alfalfasprout | karma 3739 | avg karma 3.95 · | 2017-10-07 17:11:45+00:00

Well... I feel like people are only focusing on training. CPUs for inference are sometimes helpful in online inference settings where you don't ever really "batch" requests. In this case, they can be cheap.

smoldesu | karma 10042 | avg karma 1.14 · | 2023-10-11 16:44:14

For applications that aren't latency sensitive, I run inference on a free 4 core Ampere server from Oracle. Once you ditch the "fast" prerequisite, a lot of hardware becomes viable.

qayxc | karma 3685 | avg karma 2.51 · | 2021-06-25 08:44:12+00:00

The model itself is hardware-agnostic, so there's nothing preventing someone from building a frontend for their platform of choice.

Granted, powerful hardware is still required to run inference at acceptable speeds (or at all - I don't know the memory requirements).

reply

mendeza | karma 314 | avg karma 1.52 · | 2020-12-10 00:01:00

Wow, this is really cool! Is inference running on GPU or CPU? I wonder how well this would run if model storage and inference is done on a jetson nano!