I did a very similar analysis with Llama 65B being trained on 5.6T tokens assuming token length of 4 characters and comparing with a quantized model size of ~38GB.
The 3% number was a conservative rounding of the same calculation, but retaining fp16 rather than quantizing to 4 bits.
I am currently testing the limits and got llama 3 70B in a 2bit-quantized form to run on my laptop with very low specs RTX3080 8GB VRAM (laptop version) and 16GB system RAM. It runs with 1,2 tokens/s which is a bit slow. The biggest issue however is the time it takes for the first token to be printed which fluctuates and takes between 1.8s to 45s.
I tested the same model on a 4070 with 16GB VRAM (desktop pc version) and 32GB system RAM and it runs at about 3-4 tokens per second. The 4070 also has the issue with quite long time for the first token to be displayed i think it was around 12s in my limited testinh.
I still try to find out how to speed the time to initial token up. 4 tokens a second is usable for many cases because that's about reading speed.
There are also 1bit-quantized 70B models appearing so there might be ways to make it even a bit faster on consumer GPUs.
I think we are at the bare edge of usability here and I keep testing.
I can not tell exactly how this strong quantization affects output quality information about that is mixed and seems to depand on the form of quantization as well.
Sorry, it sounds like you know a lot more than I do about this, and I'd appreciate it if you'd connect the dots. Is your comment a dig at either Snowflake or Llama? Where are you finding the unquantized size of Llama 3 70B? Isn't it extremely rare to do inference with large unquantized models?
The LLaMA paper contradicts this view:
"[...] Although Hoffmann et al. (2022) recommends training a 10B model on 200B tokens, we find that the performance of a 7B model continues to improve even after 1T tokens."
https://arxiv.org/pdf/2302.13971.pdf
Less tokens than Llama 3 (3.3T vs 15T) yet better outcome. No doubt more information dense training data. The interesting thing is the use of synthetic data which they don't talk about.
You should try ollama and see what happens. On the same hardware, with the same q8_0 quantization on both models, I'm seeing 77 tokens/s with Llama3-8B and 72 tokens/s with CodeGemma-7B, which is a very surprising result to me, but they are still very similar in performance.
This copy is wrong -- it's pre-prepped copy for a model run that didn't pan out. Updating to correct number -- already present in the training grid further down in the model card. Bit over 830M tokens for this stage and >1B for all extension stages combined.
Your point still stands. We wanted to get something out asap and finetune more later. I believe the giant vocab size of llama 3 is actually adversarial for finetunes. You need a beefy dataset to even hit all vocab tokens a single time with a forward and backward.
Here is one question I have not seen answered yet:
All the magic of "7B LLaMA running on a potato" seems to involve lowering precision down to f16 and then further quantizing to int4.
Clearly this quantized model still outputs something resembling human language, at the very least.
But I haven't seen anyone show what effect this quantizing has on the quality of the output. If the quality of the output is bad, it's unclear if it's because the model needs to be finetuned (as Stanford did here) or if it's because the quanitizing reduced the quality, or both.
If this fine-tuned Stanford model still has excellent output after quantizing it to run on a Raspberry Pi 4GB, that would be awesome!
It's interesting to me that LLaMA-nB's still produce reasonable results after 4-bit quantization of the 32-bit weights. Does this indicate some possibility of reducing the compute required for training?
Theoretically this is an even better size, as it would fit on a 20GB-24GB GPU with more relaxed quantization and much longer context.
Metrics are slightly below 13B, but the theory is that the higher parameter count is more amenable to finetuning. If you search for 22B on huggingface, you can see that frankenllama experiments are ongoing:
That's why I provided the demo link, see the tokens/second. They are running LLaMA 70B at about 260T/s without quantization. This is the fastest LLaMA2 model
Releasing 8B and 70B (both base and finetuned) models, strong-performing in their model class (but we'll see when the rankings come in @
@lmsysorg
:))
400B is still training, but already encroaching GPT-4 territory (e.g. 84.8 MMLU vs. 86.5 4Turbo).
Tokenizer: number of tokens was 4X'd from 32K (Llama 2) -> 128K (Llama 3). With more tokens you can compress sequences more in length, cites 15% fewer tokens, and see better downstream performance.
Architecture: no major changes from the Llama 2. In Llama 2 only the bigger models used Grouped Query Attention (GQA), but now all models do, including the smallest 8B model. This is a parameter sharing scheme for the keys/values in the Attention, which reduces the size of the KV cache during inference. This is a good, welcome, complexity reducing fix and optimization.
Sequence length: the maximum number of tokens in the context window was bumped up to 8192 from 4096 (Llama 2) and 2048 (Llama 1). This bump is welcome, but quite small w.r.t. modern standards (e.g. GPT-4 is 128K) and I think many people were hoping for more on this axis. May come as a finetune later (?).
Training data. Llama 2 was trained on 2 trillion tokens, Llama 3 was bumped to 15T training dataset, including a lot of attention that went to quality, 4X more code tokens, and 5% non-en tokens over 30 languages. (5% is fairly low w.r.t. non-en:en mix, so certainly this is a mostly English model, but it's quite nice that it is > 0).
Scaling laws. Very notably, 15T is a very very large dataset to train with for a model as "small" as 8B parameters, and this is not normally done and is new and very welcome. The Chinchilla "compute optimal" point for an 8B model would be train it for ~200B tokens. (if you were only interested to get the most "bang-for-the-buck" w.r.t. model performance at that size). So this is training ~75X beyond that point, which is unusual but personally, I think extremely welcome. Because we all get a very capable model that is very small, easy to work with and inference. Meta mentions that even at this point, the model doesn't seem to be "converging" in a standard sense. In other words, the LLMs we work with all the time are significantly undertrained by a factor of maybe 100-1000X or more, nowhere near their point of convergence. Actually, I really hope people carry forward the trend and start training and releasing even more long-trained, even smaller models.
Systems. Llama 3 is cited as trained with 16K GPUs at observed throughput of 400 TFLOPS. It's not mentioned but I'm assuming these are H100s at fp16, which clock in at 1,979 TFLOPS in NVIDIA marketing materials. But we all know their tiny asterisk (*with sparsity) is doing a lot of work, and really you want to divide this number by 2 to get the real TFLOPS of ~990. Why is sparsity counting as FLOPS? Anyway, focus Andrej. So 400/990 ~= 40% utilization, not too bad at all across that many GPUs! A lot of really solid engineering is required to get here at that scale.
TLDR: Super welcome, Llama 3 is a very capable looking model release from Meta. Sticking to fundamentals, spending a lot of quality time on solid systems and data work, exploring the limits of long-training models. Also very excited for the 400B model, which could be the first GPT-4 grade open source release. I think many people will ask for more context length.
Personal ask: I think I'm not alone to say that I'd also love much smaller models than 8B, for educational work, and for (unit) testing, and maybe for embedded applications etc. Ideally at ~100M and ~1B scale.
The 3% number was a conservative rounding of the same calculation, but retaining fp16 rather than quantizing to 4 bits.
Here's my original back of the napkin analysis:
https://news.ycombinator.com/item?id=36681440
reply