Hacker Read top | best | new | newcomments | leaders | about | bookmarklet login

Thanks for the reply.

I am currently testing the limits and got llama 3 70B in a 2bit-quantized form to run on my laptop with very low specs RTX3080 8GB VRAM (laptop version) and 16GB system RAM. It runs with 1,2 tokens/s which is a bit slow. The biggest issue however is the time it takes for the first token to be printed which fluctuates and takes between 1.8s to 45s.

I tested the same model on a 4070 with 16GB VRAM (desktop pc version) and 32GB system RAM and it runs at about 3-4 tokens per second. The 4070 also has the issue with quite long time for the first token to be displayed i think it was around 12s in my limited testinh.

I still try to find out how to speed the time to initial token up. 4 tokens a second is usable for many cases because that's about reading speed.

There are also 1bit-quantized 70B models appearing so there might be ways to make it even a bit faster on consumer GPUs.

I think we are at the bare edge of usability here and I keep testing.

I can not tell exactly how this strong quantization affects output quality information about that is mixed and seems to depand on the form of quantization as well.



view as:

Legal | privacy