Has anyone published a similar run of benchmarks with llama2 70B but at different quantization levels? I assume this benchmark is evaluated on the base model run at FP16. How much does it lose quantizing to INT8?
> how everyone is in this mad quantization rush but nobody's putting up benchmarks to show that it works (tinybox is resolutely supporting non quantized LLaMA)
I don't think this is true. llama.cpp has historically been very conscientious about benchmarking perplexity. Here's a detailed chart of baseline FP16 vs the new k-quants: https://github.com/ggerganov/llama.cpp/pull/1684
While most evals aren't currently evaluating performance between quantized models, there are two evals that are:
* llm-jeopardy: https://github.com/aigoopy/llm-jeopardy - You can see that the same Airoboros 65B model goes from a score of 81.62% to 80.00% going from an 8_0 to 5_1 quant, and 5_1 solidly beats out the 33B 8_0, as expected.
Also, GPTQ, SPQR, AWQ, SqueezeLLM all have arXiv papers and every single team is running their own perplexity tests.
Now, that being said, every code base seems to be calculating perplexity slightly differently. I recently have been working on trying to decode them all for apples-to-apples comparisons between implementations.
I've done some experiments here with Llama 13B, in my subjective experience the original fp16 model is significantly better (particularly on coding tasks). There are a bunch of synthetic benchmarks such a wikitext2 PPL and all the whiz bang quantization schemes seem to score well but subjectively something is missing.
I've been able to compare 4 bit GPTQ, naive int8, LLM.int8, fp16, and fp32. LLM.int8 does impressively well but inference is 4-5x slower than native fp16.
Oddly I recently ran a fork of the model on the ONNX runtime, I'm convinced that the model performed better than pytorch/transformers, perhaps subtle differences in floating point behavior etc between kernels on different hardware significantly influence performance.
The most promising next step in the quantization space IMO has to be fp8, there's a lot of hardware vendors adding support, and there's a lot of reasons to believe fp8 will outperform most current quantization schemes [1][2]. Particularly when combined with quantization aware training / fine tuning (I think OpenAI did something similar for GPT3.5 "turbo").
If anybody is interested I'm currently working on an open source fp8 emulation library for pytorch, hoping to build something equivalent to bitsandbytes. If you are interested in collaborating my email is in my profile.
Anyone have benchmarks on how the llama 3 8b model performs when quantized to varying degrees? I reckon many people will be running these with llama.cpp or similar.
The performance loss from BiLLM is disastrous. It's basically useless. No one would ever want to use the resulting models. They hide their main results in the appendix: Table 8. page 15. https://arxiv.org/pdf/2402.04291
I won't go over the entire table in detail, but PIQA, BoolQ, HellaSwag, and WinoGrande should in the mid-to-high 70s for LLaMa2-7B. They drop that to 58, 62, 32, and 51. There are 700M parameter models that perform much better.
What they should have reported is effective number of parameters. Does LLaMa2-7B with their quantization method outperform a model that has amount of computation but uses that compute with say.. 16-bit quantization? If the answer is no, and it seems like it very clearly is, then the method is wholly worthless. Just use a smaller model to begin with.
The BitNet paper is better. But for some reason they only consider very small models. It's the obvious question and in their FAQ they don't provide a solid answer to it. Despite having all of the compute resources of MS. They could have easily run this experiment in the past year; I'm suspicious.
It's weird that not once do they mention or compare their results to the already-available quantization methods. I normally try to give benefit of the doubt, but there's really no way they're not aware that there are already widely used techniques for accomplishing this same thing, so the comparison benchmarks really should be there.
To fill in the gap, here's llama.cpp's comparison chart[0] for the different quantizations available for Llama 1. We can't compare directly with their Llama 2 metrics, but just comparing the percent change in speed and perplexity, MK-1 looks very similar to Q5_1. There's a small but not insignificant hit to perplexity, and a just over 2x speedup.
If these numbers are accurate, you can download pre-quantized Llama 2 models from Hugging Face that will perform essentially the same as what MK-1 is offering, with the Q5 files here: https://huggingface.co/TheBloke/Llama-2-13B-GGML/tree/main
4-bit quantized llama gets borderline-acceptable performance on an iPad CPU[1], and there's no reason they can't further optimise CPU perf for this task. IIUC, LLM inference is mostly constrained by memory size and bandwidth (hence the aggressive quantization), rather than raw compute.
There are already papers on it, and there is 2-bit quant in llama.cpp.
But it seems to be past the point of diminishing returns, where you mind as well use a model with fewer parameters... For now.
There was another scheme in a paper where the "sparse" majority of the model was highly quantized, while the "dense" part was left in FP16, with good results.
Side question, but how do these models are benchmarked, and how is this subfield evolving these days?
I have seen many papers relying on standard student tests performance, but they don’t seem very accurate since LLAMA-based models perform almost as good as chatGPT (3/3.5) despite being apparently being an order of magnitude worse in practice.
This code runs Llama2 quantized and unquantized in a roughly minimal way: https://github.com/srush/llama2.rs (though extracting the quantized 70B weights takes a lot of RAM). I'm running the 13B quantized model on ~10-11GB of CPU memory.
Is it compared anywhere with a smaller model fine-tuned on the same thing? Edit: I didn't see any comparisons skimming the paper. More specifically fine tuned smaller models can often be pretty good. I'd want to see how 3B and 1B etc llama models fine-tuned on the same dataset perform. Is sparsity the key here or is it just fewer parameters?
The quantization part was interesting - they also quantized activations and have some adaptive quantization that accommodates outliers.
I did a very similar analysis with Llama 65B being trained on 5.6T tokens assuming token length of 4 characters and comparing with a quantized model size of ~38GB.
The 3% number was a conservative rounding of the same calculation, but retaining fp16 rather than quantizing to 4 bits.
I've used both the 7B and 13B instruction tuned llama weights (quantized using the llama.cpp scripts). Either I am doing something wrong, or these two models are no-where near the level of ChatGPT. Many times they return something totally irrelevant to my question, stop responding, use a different language, or otherwise return the wrong answer. ChatGPT does none of this. (other than the wrong answer due to hallucinating sometimes...)
Reading through the README and issues on the llama.cpp project, there is some speculation that there is a bug in the quantization, or possibly a bug in the inference (less likely I think).
I hope this is true and once fixed the models can perform up to or past the ChatGPT level. If its not true and these models are performing correctly, then either the metrics used to compare it to GPT is garbage and don't capture the real world uses, or the instruction tuning done by the Stanford team is not up to par.
Careful though — we need to evaluate llama on its own merits. It’s easy to mess up the quantization in subtle ways, then conclude that the outputs aren’t great. So if you’re seeing poor results vs gpt-3, hold off judgement till people have had time to really make sure the quantized models are >97% the effectiveness of the original weights.
That said, this is awesome — please share some outputs! What’s it like?
That's why I provided the demo link, see the tokens/second. They are running LLaMA 70B at about 260T/s without quantization. This is the fastest LLaMA2 model
And here are some independent benchmarks https://artificialanalysis.ai/models/llama-2-chat-70b
reply