Hacker Read

fredoliveira · 2023-09-27 12:42:48

You'll be able to use it with llama.cpp soon [1], so, should run fine on your macbook, yes.

[1]: https://github.com/ggerganov/llama.cpp/pull/3362#issuecommen...

bhouston | karma 13978 | avg karma 4.14 · | 2023-08-02 13:56:35

you can try llama.cpp with a small model, a 4bit 7B model I suggest. They run slow on my M1 MacBook with 16GB of ram, so if it does work it will be quite painful.

I run the 30B 4bit model on my M2 MacMini 32GB and it works okay, the 7B model is blazingly fast on that machine.

reply

anentropic | karma 1960 | avg karma 1.69 · | 2023-04-12 07:22:33

The unified memory ought to be great for running LLaMA on the GPU on these Macbooks (since it can't run on the Neural Engine currently)

The point of llama.cpp is most people don't have a GPU with enough RAM, Apple unified memory ought to solve that

Some people have it working apparently:

https://github.com/remixer-dec/llama-mps

reply

hereonout2 | karma 246 | avg karma 2.48 · | 2023-09-12 15:32:22

It runs fantastically well on M2 Mac + llama.cpp, such a variety of factors in the Apple hardware making it possible. The ARM fp16 vector intrinsics, the Macbook's AMX co-processor, the unified memory architecture, etc.

It's more than fast enough for my experiments and the laptop doesn't seem to break a sweat.

reply

throw03172019 | karma 820 | avg karma 1.43 · | 2023-10-27 01:21:02

I ran llama2 on my MacBook Pro m1 64gb without an issue.

vessenes | karma 7855 | avg karma 5.78 · | 2023-04-29 17:35:07

Most places that recommend llama.cpp for mac fail to mention https://github.com/jankais3r/LLaMA_MPS, which runs unquantized 7b and 13b models on the M1/M2 GPU directly. It's slightly slower, (not a lot), and significantly lower energy usage. To me the win not having to quantize while not melting a hole in my lap is huge; I wish more people knew about it.

mechagodzilla | karma 805 | avg karma 3.3 · | 2023-08-09 18:17:16

I've been able to run it fine using llama.cpp on my 2019 iMac with 128GB of RAM. It's not super fast, but it works fine for "send it a prompt, look at the reply a few minutes later", and all it cost me was a few extra sticks of RAM.

tarruda | karma 2401 | avg karma 4.66 · | 2023-12-08 12:12:50

> I can run it on my Macbook Air at 12tkps, can't wait to try this on my desktop.

That seems kinda low, are you using Metal GPU acceleration with llama.cpp? I don't have a macbook, but saw some of the llama.cpp benchmarks that suggest it can reach close to 30tk/s with GPU acceleration.

reply

falava | karma 3582 | avg karma 10.92 · | 2024-01-05 10:07:47

Performance of llama.cpp on Apple Silicon M-series:

https://github.com/ggerganov/llama.cpp/discussions/4167

reply

Gigachad | karma 15126 | avg karma 2.88 · | 2022-09-28 16:31:52

It runs pretty well on my MacBook.

selectodude | karma 6653 | avg karma 2.78 · | 2023-06-05 16:11:32

When llamaCPP came out, I was running 13B at 100ms/token on a base model MacBook Pro 14".

Edit: apparently llama.cpp supports running on GPU, so I imagine it's gonna be a bit faster. Maybe a fun evening project for me to get going.

reply

rgbrgb | karma 4183 | avg karma 3.2 · | 2023-08-24 18:29:21

This runs locally on a MacBook.

mozillas | karma 147 | avg karma 2.13 · | 2023-05-14 09:31:25

I ran the 7B Vicuna (ggml-vic7b-q4_0.bin) on a 2017 MacBook Air (8GB RAM) with llama.cpp.

Worked OK for me with the default context size. 2048, like you see in most examples was too slow for my taste.

reply

youniverse | karma 59 | avg karma 0.87 · | 2023-12-01 14:54:17

Anyone know if this (or any LLM) is worth running locally on a Macbook M2 Air?

I'll give it a shot over the weekend but if anyone knows I'm curious!

reply

bilsbie | karma 6199 | avg karma 3.76 · | 2023-11-29 16:42:24

Thanks for the tip! Any chance this would run on a 2011 MacBook?

tarruda | karma 2401 | avg karma 4.66 · | 2023-03-16 13:06:33

Looking forward to try it, but I don't have a macbook. I wonder if it runs on i7-11800h (8 core 16 thread CPU) with 64gb RAM

AnthonyMouse | karma 32571 | avg karma 2.49 · | 2023-04-29 19:11:29

Most of these implementations are not platform-specific. I've been running llama.cpp on x86_64 hardware and the performance is fine. The small models are fast and the quantized 65B model generates about a token per second on a system with dual-channel DDR4, which isn't unusable.

The tough thing to find is something affordable that will run the unquantized 65B model at an acceptable speed. You can put 128GB of RAM in affordable hardware but ordinary desktops aren't fast. The things that are fast are expensive (e.g. I bet Epyc 9000 series would do great). And that's the thing Apple doesn't get you either, because Apple Silicon isn't available with that much RAM, and if it was it wouldn't be affordable (the 96GB Macbook Pro, which isn't enough to run the full model, is >$4000).

reply

xena | karma 7300 | avg karma 5.69 · | 2023-12-11 08:19:24

The GGUF variant looks promising because then I can run it on my MacBook (barely)

rootusrootus | karma 23619 | avg karma 2.53 · | 2023-06-24 18:02:23

The recent update to llama.cpp to support Apple Silicon (at least for Q4_0 models) is really great. I can run pretty big models on my macbook now with decent performance, and smaller 13B models just fly. The ability to get a lot of GPU-accessible RAM is perfect.

david | karma 219 | avg karma 0.99 · | 2007-09-24 03:50:27+00:00

I run it on a powerbook. The only problem is there are no drivers for the nvidia video card, so no acceleration and suspend to ram doesn't work. Theres no flash support either, but I don't think that applies to Intel MacBooks.