Hacker Read

adfbkandfionio | karma 7 | avg karma 7.0 · 2024-02-26 18:58:20

>And they are able to be linked together such that they all share the same memory via NVLink. This makes them scalable for processing the large data and holding the models for the larger scale LLMs and other NN/ML based models.

GPUs connected with NVLink do not exactly share memory. They don't look like a single logical GPU. One GPU can issue loads or stores to a different GPU's memory using "GPUDirect Peer-To-Peer", but you cannot have a single buffer or a single kernel that spans multiple GPUs. This is easier to use and more powerful than the previous system of explicit copies from device to device, perhaps, but a far cry from the way multiple CPU sockets "just work". Even if you could treat the system as one big GPU you wouldn't want to. The performance takes a serious hit if you constantly access off-device memory.

NVLink doesn't open up any functionality that isn't available over PCIe, as far as I know. It's "merely" a performance improvement. The peer-to-peer technology still works without NVLink.

NVidia's docs are, as always, confusing at best. There are several similarly-named technologies. The main documentation page just says "email us for more info". The best online documentation I've found is in some random slides.

https://developer.nvidia.com/gpudirect

https://developer.download.nvidia.com/CUDA/training/cuda_web...

reply

rhdunn | karma 618 | avg karma 2.26 · 2024-02-26 20:04:16

Interesting. So that would mean that you would still need a 40 or 80 GB card to run the larger models (30B LLM, 70B LLM, 8x7B LLM) and perform training of them.

Or would it be possible to split the model layers between the cards like you can between RAM and VRAM? I suppose in that case each card would be able to evaluate the results of the layers in its own memory and then pass those results to the other card(s) as necessary.

reply

beecafe | karma 310 | avg karma 1.83 · 2024-02-26 21:46:00

[dead]