Hacker Read top | best | new | newcomments | leaders | about | bookmarklet login

Not a big fan of how server-centric the LLM landscape is. I want something that can run locally, and doesn't require any special setup. One install + one model import maximum. Currently unless I want to go clone git repos, install Python dependencies and buy an Nvidia GPU I'm stuck waiting for it to become part of https://webllm.mlc.ai/. That's a website, come to think of it, but at least the computation happens locally with minimal fuss.


view as:

You can get llama CPP or kobold.cpp binaries and load a quantized model right into them on the CPU only, no need to install Python or have an Nvidia GPU.

Well, I'd like it to respond in something close to real-time, and since I have a pretty good non-Nvidia GPU, it makes more sense to wait for the WebGPU port.

7 tokens per sec on an i5-11400 CPU using llama.cpp - that's pretty real time for personal use I would think.

> Not a big fan of how server-centric the LLM landscape is.

That's just not true. You can get ooba[1] running in no time, which is 100% made for desktop usage. There's also koboldcpp and other solution also made for desktop users. In fact, most LLM communities are dominated by end users who run these LLMs on their desktops to roleplay.

AMD being awful is orthogonal here.

[1] https://github.com/oobabooga/text-generation-webui


Legal | privacy