Has anyone published a similar run of benchmarks with llama2 70B but at different quantization levels? I assume this benchmark is evaluated on the base model run at FP16. How much does it lose quantizing to INT8?
Anyone have benchmarks on how the llama 3 8b model performs when quantized to varying degrees? I reckon many people will be running these with llama.cpp or similar.
That's why I provided the demo link, see the tokens/second. They are running LLaMA 70B at about 260T/s without quantization. This is the fastest LLaMA2 model
I've done some experiments here with Llama 13B, in my subjective experience the original fp16 model is significantly better (particularly on coding tasks). There are a bunch of synthetic benchmarks such a wikitext2 PPL and all the whiz bang quantization schemes seem to score well but subjectively something is missing.
I've been able to compare 4 bit GPTQ, naive int8, LLM.int8, fp16, and fp32. LLM.int8 does impressively well but inference is 4-5x slower than native fp16.
Oddly I recently ran a fork of the model on the ONNX runtime, I'm convinced that the model performed better than pytorch/transformers, perhaps subtle differences in floating point behavior etc between kernels on different hardware significantly influence performance.
The most promising next step in the quantization space IMO has to be fp8, there's a lot of hardware vendors adding support, and there's a lot of reasons to believe fp8 will outperform most current quantization schemes [1][2]. Particularly when combined with quantization aware training / fine tuning (I think OpenAI did something similar for GPT3.5 "turbo").
If anybody is interested I'm currently working on an open source fp8 emulation library for pytorch, hoping to build something equivalent to bitsandbytes. If you are interested in collaborating my email is in my profile.
While llama3-8b might be slightly more brittle under quantization, llama3-70b really surprised myself and others[1] in how well it performs even in the 2..3 bits per parameter regime. It requires one of the most advanced quantization methods (IQ2_XS specifically) to work, but the reward is a SoTA LLM that fits on one 4090 GPU with 8K context (KV-cache uncompressed btw) and allows for advanced usecases such as powering the agent engine I'm working on: https://github.com/kir-gadjello/picoagent-rnd
For me it completely replaced strong models such as Mixtral-8x7B and DeepSeek-Coder-Instruct-33B.
This code runs Llama2 quantized and unquantized in a roughly minimal way: https://github.com/srush/llama2.rs (though extracting the quantized 70B weights takes a lot of RAM). I'm running the 13B quantized model on ~10-11GB of CPU memory.
> how everyone is in this mad quantization rush but nobody's putting up benchmarks to show that it works (tinybox is resolutely supporting non quantized LLaMA)
I don't think this is true. llama.cpp has historically been very conscientious about benchmarking perplexity. Here's a detailed chart of baseline FP16 vs the new k-quants: https://github.com/ggerganov/llama.cpp/pull/1684
While most evals aren't currently evaluating performance between quantized models, there are two evals that are:
* llm-jeopardy: https://github.com/aigoopy/llm-jeopardy - You can see that the same Airoboros 65B model goes from a score of 81.62% to 80.00% going from an 8_0 to 5_1 quant, and 5_1 solidly beats out the 33B 8_0, as expected.
Also, GPTQ, SPQR, AWQ, SqueezeLLM all have arXiv papers and every single team is running their own perplexity tests.
Now, that being said, every code base seems to be calculating perplexity slightly differently. I recently have been working on trying to decode them all for apples-to-apples comparisons between implementations.
It's weird that not once do they mention or compare their results to the already-available quantization methods. I normally try to give benefit of the doubt, but there's really no way they're not aware that there are already widely used techniques for accomplishing this same thing, so the comparison benchmarks really should be there.
To fill in the gap, here's llama.cpp's comparison chart[0] for the different quantizations available for Llama 1. We can't compare directly with their Llama 2 metrics, but just comparing the percent change in speed and perplexity, MK-1 looks very similar to Q5_1. There's a small but not insignificant hit to perplexity, and a just over 2x speedup.
If these numbers are accurate, you can download pre-quantized Llama 2 models from Hugging Face that will perform essentially the same as what MK-1 is offering, with the Q5 files here: https://huggingface.co/TheBloke/Llama-2-13B-GGML/tree/main
You should try ollama and see what happens. On the same hardware, with the same q8_0 quantization on both models, I'm seeing 77 tokens/s with Llama3-8B and 72 tokens/s with CodeGemma-7B, which is a very surprising result to me, but they are still very similar in performance.
The performance loss from BiLLM is disastrous. It's basically useless. No one would ever want to use the resulting models. They hide their main results in the appendix: Table 8. page 15. https://arxiv.org/pdf/2402.04291
I won't go over the entire table in detail, but PIQA, BoolQ, HellaSwag, and WinoGrande should in the mid-to-high 70s for LLaMa2-7B. They drop that to 58, 62, 32, and 51. There are 700M parameter models that perform much better.
What they should have reported is effective number of parameters. Does LLaMa2-7B with their quantization method outperform a model that has amount of computation but uses that compute with say.. 16-bit quantization? If the answer is no, and it seems like it very clearly is, then the method is wholly worthless. Just use a smaller model to begin with.
The BitNet paper is better. But for some reason they only consider very small models. It's the obvious question and in their FAQ they don't provide a solid answer to it. Despite having all of the compute resources of MS. They could have easily run this experiment in the past year; I'm suspicious.
Why is nobody commenting about the quality of these models?
I totally understand that quantization is decreasing quality and capabilities a bit but I haven't seen anybody verifying the claim: LLaMA 13B > GPT-3. I was expecting LLaMA 65B to be as coherent as GPT-3 but LLaMA 65B (when run quantized) seems to think 2012 is in the future.
Maybe. Goliath 120B took two different llama variants and interwove the layers. Surprisingly Goliath 120B quantized to 2bit is outperforming llama 70B 4bit in many benchmarks.
I suspect the community will start creating lower precision/quantized versions of the model very quickly. LLaMa 30b quantized to 4 bits is runnable on a 3090/4090.
I am currently testing the limits and got llama 3 70B in a 2bit-quantized form to run on my laptop with very low specs RTX3080 8GB VRAM (laptop version) and 16GB system RAM. It runs with 1,2 tokens/s which is a bit slow. The biggest issue however is the time it takes for the first token to be printed which fluctuates and takes between 1.8s to 45s.
I tested the same model on a 4070 with 16GB VRAM (desktop pc version) and 32GB system RAM and it runs at about 3-4 tokens per second. The 4070 also has the issue with quite long time for the first token to be displayed i think it was around 12s in my limited testinh.
I still try to find out how to speed the time to initial token up. 4 tokens a second is usable for many cases because that's about reading speed.
There are also 1bit-quantized 70B models appearing so there might be ways to make it even a bit faster on consumer GPUs.
I think we are at the bare edge of usability here and I keep testing.
I can not tell exactly how this strong quantization affects output quality information about that is mixed and seems to depand on the form of quantization as well.
Note that quantized versions of llama3 70B can be ran on CPU on much cheaper server. I am personally using it via llama.cpp on bare metal 6-core Xeon CPU with 128G RAM for ~50 euro monthly.
There are benchmarks in the original LLaMA paper[1]. Specifically, on page 4 LLaMA 13B seems to beat GPT-3 in BoolQ, HellaSwag, WinoGrande, ARC-e and ARC-c benchmarks (not by much though). Examples that you've seen are likely to be based on some form quantisation / poor prompt that degrade output. My understanding that the only quantisation that doesn't seem to hurt the output is llm.int8 by Tim Dettmers. You should be able to run LLaMA 13B (8 bit quantised) on the 3090 or 4090 consumer grade GPU as of now. Also, you'd need a prompt such as LLaMA precise[2] in order to get ChatGPT like output.
reply