Gamers have a TON of really good really affordable options. But you kind of need 24gb min unless you're using heavy quantization. So 3090 and 4090's are what local llm people are building with (mostly 3090's as you can get then for about $700, and they're dang good)
If anything, 24GB is probably the sweet spot for what's optimised for as many local LLM enthusiasts are running 3090s and 4090s (and all want more VRAM).
Are you aware that cards containing “LLM” (40-80GB) levels of VRAM cost substantially more and the status quo for consumer cards hovers around 4-12GB, only going to 24GB for top end cards?
You can finetune whisper, stable diffusion, and LLM up to about 15B parameters with 24GB VRAM.
Which leads you to what hardware to get. Best bang for the $ right now is definitely a used 3090 at ~$700. If you want more than 24GB vram just rent the hardware as it will be cheaper.
If you're not willing to drop $700 don't buy anything just rent. I have had decent luck with vast.ai
Ok can someone catch me up to speed on LLM hardware requirements? Last I looked I needed a 20 gb vram card to run a good one. Is that not true anymore?
Today is a gaming product announcement and I'm not sure that games have a need even beyond 12 GB today. I suspect that the 4090 price not budging much is telling as far as what ram amount demand has been focused on for games.
I assume they will soon have a professional card announcement that includes 48GB+ cards. Assuming that the high ram cards have improvements similar to this generational leap in the gaming market, they will be in high demand.
24GB is enough for some serious AI work. 48GB would be better, of course. But high end GPUs are still used for other things than gaming, from ML/AI stuff to creative work like video editing, animation renders and more.
Going above 24GB is probably not going to be cheap until gddr7 is out, and even that will only push it to 36gb. The fancier stacked gddr6 stuff is probably pretty expensive and you can’t just add more dies because of signal integrity issues.
If you want to run decently heavy models, I'd recommend getting at a minimum getting 48GB. This allows you to run 34b llama models with ease, 70b models quantized, mixtral without problems.
If you want to run most models, get 64GB. This just gives you some more room to work with.
If you want to run anything, get 128GB or more. Unquantized 70b? Check. Goliath 120b? Check.
Note that high end consumer gpus end at 24GB VRAM. I have one 7900xtx for running llms, and the best it can reliably run is 4-bit quantized 34b models, anything larger is partially in regular ram.
If you want to get the most bang for your buck, you definitely need to run quantized versions. Yes, there are models that run in 11G, just like there are models that run in 8G, and for any other amount of VRAM - my point is that 24G is the sweet spot.
RTX 3090 has 24GB of memory, a quantized llama70b takes around 60GB of memory. You can offload a few layers on the gpu, but most of them will run on the CPU with terrible speeds.
reply