Hacker Read

tormeh | karma 7507 | avg karma 2.28 · 2023-09-27 16:14:37

Not a big fan of how server-centric the LLM landscape is. I want something that can run locally, and doesn't require any special setup. One install + one model import maximum. Currently unless I want to go clone git repos, install Python dependencies and buy an Nvidia GPU I'm stuck waiting for it to become part of https://webllm.mlc.ai/. That's a website, come to think of it, but at least the computation happens locally with minimal fuss.

dwringer | karma 1244 | avg karma 1.95 · 2023-09-27 16:15:53

You can get llama CPP or kobold.cpp binaries and load a quantized model right into them on the CPU only, no need to install Python or have an Nvidia GPU.

tormeh | karma 7507 | avg karma 2.28 · 2023-09-27 16:29:34

Well, I'd like it to respond in something close to real-time, and since I have a pretty good non-Nvidia GPU, it makes more sense to wait for the WebGPU port.

programd | karma 2325 | avg karma 5.95 · 2023-09-27 17:52:07

7 tokens per sec on an i5-11400 CPU using llama.cpp - that's pretty real time for personal use I would think.

redox99 | karma 2736 | avg karma 3.25 · 2023-09-27 21:12:59

> Not a big fan of how server-centric the LLM landscape is.

That's just not true. You can get ooba[1] running in no time, which is 100% made for desktop usage. There's also koboldcpp and other solution also made for desktop users. In fact, most LLM communities are dominated by end users who run these LLMs on their desktops to roleplay.

AMD being awful is orthogonal here.

[1] https://github.com/oobabooga/text-generation-webui

reply