Not a big fan of how server-centric the LLM landscape is. I want something that can run locally, and doesn't require any special setup. One install + one model import maximum. Currently unless I want to go clone git repos, install Python dependencies and buy an Nvidia GPU I'm stuck waiting for it to become part of https://webllm.mlc.ai/. That's a website, come to think of it, but at least the computation happens locally with minimal fuss.
You can get llama CPP or kobold.cpp binaries and load a quantized model right into them on the CPU only, no need to install Python or have an Nvidia GPU.
Well, I'd like it to respond in something close to real-time, and since I have a pretty good non-Nvidia GPU, it makes more sense to wait for the WebGPU port.
> Not a big fan of how server-centric the LLM landscape is.
That's just not true. You can get ooba[1] running in no time, which is 100% made for desktop usage. There's also koboldcpp and other solution also made for desktop users. In fact, most LLM communities are dominated by end users who run these LLMs on their desktops to roleplay.
reply