you can try llama.cpp with a small model, a 4bit 7B model I suggest. They run slow on my M1 MacBook with 16GB of ram, so if it does work it will be quite painful.
I run the 30B 4bit model on my M2 MacMini 32GB and it works okay, the 7B model is blazingly fast on that machine.
It runs fantastically well on M2 Mac + llama.cpp, such a variety of factors in the Apple hardware making it possible. The ARM fp16 vector intrinsics, the Macbook's AMX co-processor, the unified memory architecture, etc.
It's more than fast enough for my experiments and the laptop doesn't seem to break a sweat.
Most places that recommend llama.cpp for mac fail to mention https://github.com/jankais3r/LLaMA_MPS, which runs unquantized 7b and 13b models on the M1/M2 GPU directly. It's slightly slower, (not a lot), and significantly lower energy usage. To me the win not having to quantize while not melting a hole in my lap is huge; I wish more people knew about it.
I've been able to run it fine using llama.cpp on my 2019 iMac with 128GB of RAM. It's not super fast, but it works fine for "send it a prompt, look at the reply a few minutes later", and all it cost me was a few extra sticks of RAM.
> I can run it on my Macbook Air at 12tkps, can't wait to try this on my desktop.
That seems kinda low, are you using Metal GPU acceleration with llama.cpp? I don't have a macbook, but saw some of the llama.cpp benchmarks that suggest it can reach close to 30tk/s with GPU acceleration.
Most of these implementations are not platform-specific. I've been running llama.cpp on x86_64 hardware and the performance is fine. The small models are fast and the quantized 65B model generates about a token per second on a system with dual-channel DDR4, which isn't unusable.
The tough thing to find is something affordable that will run the unquantized 65B model at an acceptable speed. You can put 128GB of RAM in affordable hardware but ordinary desktops aren't fast. The things that are fast are expensive (e.g. I bet Epyc 9000 series would do great). And that's the thing Apple doesn't get you either, because Apple Silicon isn't available with that much RAM, and if it was it wouldn't be affordable (the 96GB Macbook Pro, which isn't enough to run the full model, is >$4000).
The recent update to llama.cpp to support Apple Silicon (at least for Q4_0 models) is really great. I can run pretty big models on my macbook now with decent performance, and smaller 13B models just fly. The ability to get a lot of GPU-accessible RAM is perfect.
I run it on a powerbook. The only problem is there are no drivers for the nvidia video card, so no acceleration and suspend to ram doesn't work. Theres no flash support either, but I don't think that applies to Intel MacBooks.
[1]: https://github.com/ggerganov/llama.cpp/pull/3362#issuecommen...
reply