Solid work. I'm holding off for about 2 years since that kind of development time will give Nvidia time to make some consumer friendly VRAM giant. (I'm assuming the 4090 is still consumer friendly)
As long as chatGPT still lets people ask medical/healthcare questions, it doesnt seem to urgent. I'm dreading when the AMA lobbies to ban it unless under Physician supervision.
Nvidia has no motivation to make a consumer card with lots of VRAM, that's basically the only (relevant) separator between the GeForce family and the Quadro lineup.
There are restrictions on NVENC streams with consumer cards, but that has been a solved problem for a while [0].
If they were to make a consumer card with more VRAM, it would immediately undercut their own Quadro/Tesla lineup, which cost substantially more. I don't see a reason for them to do it.
Apple's neural engine is really interesting as a GPU alternative. It's still just a bunch of tensor cores, but the unified memory architecture allows it to use the full system RAM without a major performance hit. I think we will see more of this in the future.
Stable Diffusion runs well on small GPUs... My 6GB laptop 2060 can run reasonably big resolutions. It can run Facebook AITemplate inference or train a LORA. A 12GB-24GB GPU can run pretty much anything with no issues or fuss.
A single 3090 (or 7900 XTX) can run LLaMA 65B reasonably quickly with llama.cpp. 2x 3090s can run it very quickly with exLLaMA.
MPT and Falcon quantization are extremely new (and TBH the finetunes are not as good as LLaMA yet), but MPT 30b will fit on a single 3090 (or a smaller GPU with llama.cpp) and Falcon (I think) can be split with GPTQ.
Based on your comment, I added a note that the GPU recommendations are for GPU clouds (where if you're running stable diffusion, ~$0.50/hr for a 4090 is pretty practical, and it's not worth going to a cheaper card for most), and I added this note re local GPUs now:
If you're using a local GPU:
* Same as above, but you probably won't be able to train or fine tune an LLM!
* Most of the open LLMs have versions available that can run on lower VRAM cards e.g. [Falcon on a 40GB card](https://huggingface.co/TheBloke/falcon-40b-instruct-GPTQ)
* Thanks Bruce for prompting me to add this section
The UIs/containers people use for 4090s specifically have some kind of performance issue with Stable Diffusion, last I checked. It runs no faster than a 3090.
I think this can be solved by running RTX 4000 cards with CUDA 12.1 and pytorch nightly cu121... But thats a very uncommon setup.
Also, if we are talking cloud deployments, cheap CPU instances with 32GB-64GB of RAM are a somewhat sane way to test LLMs, thanks to llama.cpp. A GPU will give you better throughput, of course.
Note that it's possible to use things like llama.cpp, ggllm.cpp, Kobold, etc. to run these at usable performance on fast CPUs with a good amount of RAM. This is far, far more reasonable if you just want to play around. If you want to seriously use these or experiment with any non-trivial training, that's when you need the kind of power being talked about here.
Apple Silicon is interesting for AI because the GPU shares main memory, so you can get one with a lot of RAM and use either CPU or GPU. Less support for Apple GPUs though, at least so far. Work is being done. M1/M2 Macs with 64GiB of RAM are cheaper than many of these GPUs.
Also wow does NVIDIA have near monopoly on this space. Prices will probably come down once Intel and AMD get their acts together.
Unfortunately the AMD MI300 will not be cheap, Intel's Xeon Max GPUs are more-or-less limited to HPC land until at least 2025, and Intel's Gaudi line is apparently being discontinued.
There are some worrying reports about Intel Arc gen 2, and AMD is just matching price/VRAM with Nvidia these days.
Some of the AI startups are really exotic/business scale only (like Cerebras) or MIA in the consumer/cheap cloud space (like Tenstorrent).
...There are some rumors of an M2 Pro Like APU from AMD/Intel? I hate so sound so grim, but thats about the only positive news I got.
Two users I talked with mentioned bad experiences with them. Not that it's always bad, and they mentioned that it can be good and I know the pricing is often great, but they noted bad experiences with unreliable instances. Therefore I don't want to recommend it to most people.
Are there models that can compete with gpt-3.5-turbo on cost per token at scale? From what I'm hearing the 30B+ models net out to a higher $/token but I haven't been able to find anything on the 7B and lower. Thinking about cost specifically here. We're exploring a couple fine tunes for specific tasks we have (we have the data to fine tune with) but gpt-3.5-turbo does reasonably well on the tasks so if the cost is an order of magnitude higher not sure the ROI is there.
Before considering cost, and you might've already done this, but I'd try and run a 5-shot prompt for your use case with GPT-3.5 and MPT-30B and Falcon-40B. That way you can get a sense of how performance compares, without needing to go through the fine tuning. My guess is that 3.5 might still be significantly better on the 5-shot. I guess you're really comparing base 3.5 with fine tuned MPT-30B/Falcon-40B though, so perhaps for a fairer comparison (until 3.5 fine tuning is available), you could do something like 2-shot with 3.5 and 10-shot with Falcon and MPT.
I'd love to see a couple of things added to this, pricing and specific metrics. This would give some rationale behind the recommendations that goes a little deeper than choose this. Choose this at $X/h for X tokens per second or this for half the price but Y tokens per second.
Also would be nice to compare some of the local GPU options (including Macbooks) vs cloud options.
As long as chatGPT still lets people ask medical/healthcare questions, it doesnt seem to urgent. I'm dreading when the AMA lobbies to ban it unless under Physician supervision.
reply