Hacker Read top | best | new | newcomments | leaders | about | bookmarklet login

You'll be able to use it with llama.cpp soon [1], so, should run fine on your macbook, yes.

[1]: https://github.com/ggerganov/llama.cpp/pull/3362#issuecommen...



sort by: page size:

you can try llama.cpp with a small model, a 4bit 7B model I suggest. They run slow on my M1 MacBook with 16GB of ram, so if it does work it will be quite painful.

I run the 30B 4bit model on my M2 MacMini 32GB and it works okay, the 7B model is blazingly fast on that machine.


The unified memory ought to be great for running LLaMA on the GPU on these Macbooks (since it can't run on the Neural Engine currently)

The point of llama.cpp is most people don't have a GPU with enough RAM, Apple unified memory ought to solve that

Some people have it working apparently:

https://github.com/remixer-dec/llama-mps


It runs fantastically well on M2 Mac + llama.cpp, such a variety of factors in the Apple hardware making it possible. The ARM fp16 vector intrinsics, the Macbook's AMX co-processor, the unified memory architecture, etc.

It's more than fast enough for my experiments and the laptop doesn't seem to break a sweat.


I ran llama2 on my MacBook Pro m1 64gb without an issue.

Most places that recommend llama.cpp for mac fail to mention https://github.com/jankais3r/LLaMA_MPS, which runs unquantized 7b and 13b models on the M1/M2 GPU directly. It's slightly slower, (not a lot), and significantly lower energy usage. To me the win not having to quantize while not melting a hole in my lap is huge; I wish more people knew about it.

I've been able to run it fine using llama.cpp on my 2019 iMac with 128GB of RAM. It's not super fast, but it works fine for "send it a prompt, look at the reply a few minutes later", and all it cost me was a few extra sticks of RAM.

> I can run it on my Macbook Air at 12tkps, can't wait to try this on my desktop.

That seems kinda low, are you using Metal GPU acceleration with llama.cpp? I don't have a macbook, but saw some of the llama.cpp benchmarks that suggest it can reach close to 30tk/s with GPU acceleration.


Performance of llama.cpp on Apple Silicon M-series:

https://github.com/ggerganov/llama.cpp/discussions/4167


It runs pretty well on my MacBook.

When llamaCPP came out, I was running 13B at 100ms/token on a base model MacBook Pro 14".

Edit: apparently llama.cpp supports running on GPU, so I imagine it's gonna be a bit faster. Maybe a fun evening project for me to get going.


This runs locally on a MacBook.

I ran the 7B Vicuna (ggml-vic7b-q4_0.bin) on a 2017 MacBook Air (8GB RAM) with llama.cpp.

Worked OK for me with the default context size. 2048, like you see in most examples was too slow for my taste.


Anyone know if this (or any LLM) is worth running locally on a Macbook M2 Air?

I'll give it a shot over the weekend but if anyone knows I'm curious!


Thanks for the tip! Any chance this would run on a 2011 MacBook?

Looking forward to try it, but I don't have a macbook. I wonder if it runs on i7-11800h (8 core 16 thread CPU) with 64gb RAM

Most of these implementations are not platform-specific. I've been running llama.cpp on x86_64 hardware and the performance is fine. The small models are fast and the quantized 65B model generates about a token per second on a system with dual-channel DDR4, which isn't unusable.

The tough thing to find is something affordable that will run the unquantized 65B model at an acceptable speed. You can put 128GB of RAM in affordable hardware but ordinary desktops aren't fast. The things that are fast are expensive (e.g. I bet Epyc 9000 series would do great). And that's the thing Apple doesn't get you either, because Apple Silicon isn't available with that much RAM, and if it was it wouldn't be affordable (the 96GB Macbook Pro, which isn't enough to run the full model, is >$4000).


The GGUF variant looks promising because then I can run it on my MacBook (barely)

The recent update to llama.cpp to support Apple Silicon (at least for Q4_0 models) is really great. I can run pretty big models on my macbook now with decent performance, and smaller 13B models just fly. The ability to get a lot of GPU-accessible RAM is perfect.

I run it on a powerbook. The only problem is there are no drivers for the nvidia video card, so no acceleration and suspend to ram doesn't work. Theres no flash support either, but I don't think that applies to Intel MacBooks.
next

Legal | privacy