Hacker Read top | best | new | newcomments | leaders | about | bookmarklet login

There is definitely a demand for a 30B model (aka a model that will comfortably fit on 24GB GPUs (or 32GB of system RAM) and squeeze into 16GB).


sort by: page size:

There is definitely a demand for a 30B model (aka a model that will comfortably fit on 24GB GPUs and squeeze into 16GB).

Agreed. I would love to have something with the FLOPs of a 3070/4060 and 80 gigs of even slower VRAM (not necessarily HBM/GDDR6x) that can run the XXB models.

I would expect a 40GB GPU to run 35B models.

There is a rumor about 48GB version of RTX3090 replacing the original Titan RTX with a similar price. That would make sense for Deep Learning workloads as 24GB is already too small for attention-based models.

A 34B model is probably about the largest you can run on a consumer GPU with 24GB VRAM. 70B will require A100's or a cloud host. 13B models are everywhere already. I'm sure this was a very deliberate choice - let people play with the 13B model locally to whet their appetite and then they can pay to run the 70B model on Azure.

I would very, very gladly pay for a GPU which took normal DIMMs, and let me get up to 256GB.

I can buy a 32GB DIMM for <$50. Eight of them would do it for $400.

YES, I know the performance hit, but I'm not limited so much by performance as by which models fit. That's true of a lot of other people too.

Only way I can see this happening is if Intel or AMD get serious about AI. This would seriously undercut NVidia's business.


Even 32GB would be great for a gaming card, any more and you're never seeing on sale as it will be bought by truckloads for AI, so of course they're not gonna balloon the VRAM. I suspect we'd still be at 16GB but they launched 3090 on Sep 2020 with 24GB, before all this craze really, and lowering is bad optics now.

Assuming you want to maintain full bandwidth.

Which I don't care too much about.

However, even 16->24GB is a big step, since a lot of the model are developed for 3090/4090-class hardware. 36GB would place it lose to the class of the fancy 40GB data center cards.

If Intel decided to push VRAM, it will definitely have a market. Critically, a lot of folks will also be incentivized to make software compatible, since it will be the cheapest way to run models.


A single 3090, or any 24GB GPU. Just barely.

Yi 34B is a much better fit. I can cram 75K context onto 24GB without brutalizing the model with <3bpw quantization, like you have to do with 70B for 4K context.


Presumably the 32GB of VRAM is what makes it compelling, as you could cram some fairly substantial AI models on there.

I would also love to see a consumer GPU with a lot of VRAM. I think 128 GB would be too expensive BUT I'm hoping the Nvidia 50 series comes with 32+ GB VRAM for at least the top tier cards of the family (today they max out at 24 GB VRAM in the 3090 and 4090.)

16GB of vram can run the 7B for sure, I'm not sure what the most cutting-edge memory optimization but the 15B is going to be pretty tight I'm not sure that'll fit with what I know of at least, I've got it working at a bit over 20gb of vram I think at 8bit.

If you can't fit it all in vram you can still run it but it'll be slooooow, at least that's been my experience with the 30b.


The data buffer size shown by Georgi here is 96GB, plus there is the other overhead; it states the recommended max working set size for this context is 147GB, so no Flacon 180B in Q4 as shown wouldn't fit on 4x 24GB 3090's (96GB VRAM).

But I'm also in the quad-3090 build idea stage as well and bought 2 with the intention to go up to 4 eventually for this purpose. However, since I bought my first 2 a few months back (at about 800 euro each!) the ebay prices have actually gone up... a lot; I purchased a specific model that I thought would be plentiful as I had found a seller with a lot of them from OEM pulls, and they were taking good offers- and suddenly they all got sold. I feel like we are entering another GPU gap like 2020-2021.

Based on the performance of Llama2 70B, I think 96GB of vram and the cuda core count x bandwidth of 4 3090's will hit a golden zone as far as price-performance of a deep learning rig that can do a bit of finetuning on top of just inference.

Unless A6000 prices (or A100 prices) start plummeting.

My only hold out is the thought that maybe nvidia releases a 48gb Titan-type card at a less-than-A6000 price sometime soon, which will shake things up.


The problem with that is currently, the available memory scales with the class of GPU.... and very large language models need 160-320GB of VRAM. So, there sadly isn't anything out there that you can load up a model this large on except a rack of 8x+ A40s/A100s.

I know there are memory channel bandwidth limits and whatnot but I really wish there was a card out there with a 3090 sized die but with 96GB of VRAM solely to make it easier to experiment with larger models. If it takes 8 days to train vs. 1, thats fine. having only two of them to get 192GB and still fit on a desk and draw normal power would be great.


With qlora, 7b models could be fitted in 24 GB VRAM. And I have seen few folks doing it locally in discord servers I am part of.

AMD get it together, pretty please. Nobody wants to live in a world where only NVidia produces discrete GPUs!

That 32GB seems too high for cost constrained consumer market though - may be they will have a leaner variant for desktops/gaming.


For people wanting to run it locally, you can fit the 7b model (just) into a 24GB VRAM GPU (e.g. 3090/4090). The 3b model appears to be much more reasonable, but I would say the output is.... of limited quality based on the few tests I've run thus far.

Much larger L2 cache on the GPU.

The 30xx cards needs a wide bus to feed it, but with the L2 cache on the GPU, you can get away with a smaller bus width (and faster memory on top, things shifted in the last couple of years and DDR6(X) memory is even faster now than it was).

Some of this also has to do with what kind of memory configurations you want to offer, as they are tied to the bus width. So if you widen the bus, you are looking at using more memory chips and having more memory. This pushes the price point upward, and you want it to tier somewhere in the middle of it all.

One "mistake" on the 30xx (Ampere) series were that the 3080 was much too powerful for its price point, and you couldn't really cram too much more power out above it due to its memory config. With this change, you can introduce a 4080Ti in between the 4080 16Gb and the 4090 as a mid-seasonal upgrade, and widen the bus a bit to satisfy it.


the 7B model runs on a CUDA-compatible card with 16GB of VRAM (assuming your card has 16-bit float support).

I only got the 30b model running on a 4 x Nvidia A40 setup though.

next

Legal | privacy