Hacker Read top | best | new | newcomments | leaders | about | bookmarklet login
GPU Guide (For AI Use-Cases) (gpus.llm-utils.org) similar stories update story
49 points by tikkun | karma 2720 | avg karma 3.01 2023-07-07 09:37:58 | hide | past | favorite | 27 comments



view as:

Solid work. I'm holding off for about 2 years since that kind of development time will give Nvidia time to make some consumer friendly VRAM giant. (I'm assuming the 4090 is still consumer friendly)

As long as chatGPT still lets people ask medical/healthcare questions, it doesnt seem to urgent. I'm dreading when the AMA lobbies to ban it unless under Physician supervision.


Nvidia has no motivation to make a consumer card with lots of VRAM, that's basically the only (relevant) separator between the GeForce family and the Quadro lineup.

There are restrictions on NVENC streams with consumer cards, but that has been a solved problem for a while [0].

If they were to make a consumer card with more VRAM, it would immediately undercut their own Quadro/Tesla lineup, which cost substantially more. I don't see a reason for them to do it.

0: https://github.com/keylase/nvidia-patch


>Nvidia has no motivation to make a consumer card with lots of VRAM

Money


Presumably they're making a higher margin on the professional cards.

If NVidia doesn't, somebody else will. Maybe AMD, maybe Apple, maybe Intel.

Apple's neural engine is really interesting as a GPU alternative. It's still just a bunch of tensor cores, but the unified memory architecture allows it to use the full system RAM without a major performance hit. I think we will see more of this in the future.


This seems ridiculously excessive.

Stable Diffusion runs well on small GPUs... My 6GB laptop 2060 can run reasonably big resolutions. It can run Facebook AITemplate inference or train a LORA. A 12GB-24GB GPU can run pretty much anything with no issues or fuss.

A single 3090 (or 7900 XTX) can run LLaMA 65B reasonably quickly with llama.cpp. 2x 3090s can run it very quickly with exLLaMA.

MPT and Falcon quantization are extremely new (and TBH the finetunes are not as good as LLaMA yet), but MPT 30b will fit on a single 3090 (or a smaller GPU with llama.cpp) and Falcon (I think) can be split with GPTQ.


Thanks.

Based on your comment, I added a note that the GPU recommendations are for GPU clouds (where if you're running stable diffusion, ~$0.50/hr for a 4090 is pretty practical, and it's not worth going to a cheaper card for most), and I added this note re local GPUs now:

  If you're using a local GPU:
  
  * Same as above, but you probably won't be able to train or fine tune an LLM!
  * Most of the open LLMs have versions available that can run on lower VRAM cards e.g. [Falcon on a 40GB card](https://huggingface.co/TheBloke/falcon-40b-instruct-GPTQ)
  * Thanks Bruce for prompting me to add this section

The UIs/containers people use for 4090s specifically have some kind of performance issue with Stable Diffusion, last I checked. It runs no faster than a 3090.

I think this can be solved by running RTX 4000 cards with CUDA 12.1 and pytorch nightly cu121... But thats a very uncommon setup.

Also, if we are talking cloud deployments, cheap CPU instances with 32GB-64GB of RAM are a somewhat sane way to test LLMs, thanks to llama.cpp. A GPU will give you better throughput, of course.

But CPU support for Falcon and MPT specifically are in flux, see this post for instance: https://huggingface.co/TheBloke/mpt-30B-instruct-GGML


For people reading using a 4090, this comment from March might help: https://www.reddit.com/r/StableDiffusion/comments/y71q5k/com...

Wait, can llama.cpp accelerate inference without full VRAM for the model? Or do you mean Llama 65B quantized to fit?

Yeah, it can split weights. Whatever fraction of the weights that don't fit into vram will be computed on the CPU (with reasonable speed).

Additionally, prompt processing will work with large models even with low vram GPUs.


You're able to train LORAs with 6gb VRAM? Do you have any guides or references by chance?

Note that it's possible to use things like llama.cpp, ggllm.cpp, Kobold, etc. to run these at usable performance on fast CPUs with a good amount of RAM. This is far, far more reasonable if you just want to play around. If you want to seriously use these or experiment with any non-trivial training, that's when you need the kind of power being talked about here.

Apple Silicon is interesting for AI because the GPU shares main memory, so you can get one with a lot of RAM and use either CPU or GPU. Less support for Apple GPUs though, at least so far. Work is being done. M1/M2 Macs with 64GiB of RAM are cheaper than many of these GPUs.

Also wow does NVIDIA have near monopoly on this space. Prices will probably come down once Intel and AMD get their acts together.


Yes. For anyone looking for more info on running models locally, https://www.reddit.com/r/LocalLLaMA/wiki/index/ is the best place to start.

Unfortunately the AMD MI300 will not be cheap, Intel's Xeon Max GPUs are more-or-less limited to HPC land until at least 2025, and Intel's Gaudi line is apparently being discontinued.

There are some worrying reports about Intel Arc gen 2, and AMD is just matching price/VRAM with Nvidia these days.

Some of the AI startups are really exotic/business scale only (like Cerebras) or MIA in the consumer/cheap cloud space (like Tenstorrent).

...There are some rumors of an M2 Pro Like APU from AMD/Intel? I hate so sound so grim, but thats about the only positive news I got.


Hrm... yeah I've thought Xeon Max with AVX512 might be competitive with GPUs but these things are as expensive as GPUs or even more:

https://ark.intel.com/content/www/us/en/ark/products/232592/...

I've seen some interesting work on deploying models to FPGAs, but high-end FPGAs are also expensive.


Strange. There seems to be no mention of vast.ai in the GPU Cloud Providers list, but they're definitely cheaper than all of the options listed.

Perhaps the writer kept them unmentioned in an attempt to reduce demand?


Two users I talked with mentioned bad experiences with them. Not that it's always bad, and they mentioned that it can be good and I know the pricing is often great, but they noted bad experiences with unreliable instances. Therefore I don't want to recommend it to most people.

I bought a older nvidia Tesla p40 with 24gb of vram and it does a pretty good job at running SD, for a $200 gpu.

How did you get it running? Drivers seem ... tricky ... to get a hold of.

Well, first off, I'm running an AMD gpu to run my displays, so there is no issue with driver confusion between two types of nvidia gpus.

The drivers were pretty easy to find. I think I just googled them.


Mild tangent: I love that we now have a supercomputing architecture named after Grace Hopper.

Are there models that can compete with gpt-3.5-turbo on cost per token at scale? From what I'm hearing the 30B+ models net out to a higher $/token but I haven't been able to find anything on the 7B and lower. Thinking about cost specifically here. We're exploring a couple fine tunes for specific tasks we have (we have the data to fine tune with) but gpt-3.5-turbo does reasonably well on the tasks so if the cost is an order of magnitude higher not sure the ROI is there.

Before considering cost, and you might've already done this, but I'd try and run a 5-shot prompt for your use case with GPT-3.5 and MPT-30B and Falcon-40B. That way you can get a sense of how performance compares, without needing to go through the fine tuning. My guess is that 3.5 might still be significantly better on the 5-shot. I guess you're really comparing base 3.5 with fine tuned MPT-30B/Falcon-40B though, so perhaps for a fairer comparison (until 3.5 fine tuning is available), you could do something like 2-shot with 3.5 and 10-shot with Falcon and MPT.

I'd love to see a couple of things added to this, pricing and specific metrics. This would give some rationale behind the recommendations that goes a little deeper than choose this. Choose this at $X/h for X tokens per second or this for half the price but Y tokens per second.

Also would be nice to compare some of the local GPU options (including Macbooks) vs cloud options.

Edit: I just noticed https://gpus.llm-utils.org/recommended-gpus-and-gpu-clouds-f...


Legal | privacy