Vicuna looks pretty good. But as said, commercial use not possible.
Why do you think that Llama can be replaced? I mean it is extremely costly to train that thing. And it is there even a clean open source data set for the task?
PS: wouldn't be surprised if Meta, OpenAi, or google will train something for a Billion $ in costs of compute or more.
His estimate is that you could train a LLaMA-7B scale model for around $82,432 and then fine-tune it for a total of less than $85K. But when I saw the fine tuned LLaMA-like models they were worse in my opinion even than GPT-3. They were like GPT-2.5 or like that. Not nearly as good as ChatGPT 3.5 and certainly not ChatGPT-beating. Of course, far enough in the future you could certainly run one in the browser for $85K or much less, like even $1 if you go far enough into the future.
I think you can train LLaMA 7B (the model underlying Alpaca) for around $82,000, based on the Meta Research paper about it. Then you can fine-tune it ala Alpaca for a few hundred dollars more.
My wilder speculation is that, if you can shrink the model down to 4GB with llama.cpp 4bit quantization, it may be possible to run it entirely in the browser (ala Stable Diffusion from the other day).
This assessment is based largely on GPT-4 evaluation of the output. In actual use, Vicuna-13B isn't even as good as GPT-3.5, although I do have high hopes for 30B if and when they decide to make that available (or someone else trains it, since the dataset is out).
And don't forget that all the LLaMA-based models only have 2K context size. It's good enough for random chat, but you quickly bump into it for any sort of complicated task solving or writing code. Increasing this to 4K - like GPT-3.5 has - would require significantly more RAM for the same model size.
Pre-training of a foundational model is what you're thinking of for the "absurdly expensive" part but fine tunes are extremely cheap and undoubtedly are being done constantly. (You can see just how cheap by looking at the papers for Alpaca, Vicuna, Koala, etc). Prices dropped from about $600 to $10 for smaller models. Guanaco, using QLoRA, fine tuned llama-65b in about 1 day on a single GPU.
Another way to empirically test btw is to search for all the articles pointing out what ChatGPT gets wrong (3 or 4). I recently tested those when looking for evals and it gets the large majority (maybe 80-90% of those are answered correctly now).
"I asked Could you train a ChatGPT-beating model for $85,000 and run it in a browser?"
The answer is still no. Even Llama 65B is not "beating" ChatGPT, it is perhaps almost as good. This 7B model is anything but. The author has admitted it already, but i see he has not changed his previous article regardless.
These Llama-7B derived models are quite limited really. You can probably chat with them a bit (which is cool, no doubt!), but don't expect detailed help with anything in particular.
There is no need for a browser sandbox with the current software as long as you use safetensor model weights.
Maybe I should have phrased that better! I didn't mean that Vicuna was comparable to ChatGPT, just that it's the best Llama-based comparison you can make (since it's at least been conversationally trained).
Is it accurate to say they were trained for less than $600? Wouldn't that just be the finetuning that was done to the already existing LLaMA parameters which likely cost way more than $600 to train?
Interesting. LLAMA is trained using 16K GPUs so it would have taken around a quarter for them. An hour of GPU use costs $2-$3 so training a custom solution using LLAMA should be atleast $15K to $1M. I am trying to get started with this thing. A few guys suggested 2 GPUs were a good start but I think that would only be good for 10K training samples.
I have been fooling around with the small 7B llama models. They chat, but they are pretty dumb compared to ChatGPT. This means they are terser and they confabulate more, even for things that are common knowledge. It seems, from asking it questions about current events, that the model was trained up to data from early 2020.
I haven't seen much output yet from the biggest 65B parameter llama model. One can rent cloud VMs that can run it for $1.25 an hour or so on vast.ai to run it, but ChatGPT is $20 a month so why bother, unless you like the fully uncensored aspect.
Preface: I do not consider LLaMA or any of its fine-tuned derivatives to be truly open-source, since they can't be used for commercial purposes and have highly restrictive licenses. If it weren't for the leaked weights, models like Vicuna wouldn't exist.
I think it's somewhat unlikely that a purely open-source model can catch up in the near-term without one or a combination of the following happening:
a) significant funding for the compute resources required, potentially through massive donations by one or more wealthy open-source advocates, with the expectation of nothing in return since it wouldn't be proprietarily valuable
b) breakthroughs in design or architecture that significantly reduce necessary compute resources for initial training and/or fine-tuning
c) experts in cutting-edge AI research (the best of the best) being willing and legally allowed to contribute their unique knowledge to open-source projects, without restriction
d) another company or well-funded organization intentionally and transparently releasing an in-house trained foundational model similar to LLaMA or GPT-4 to the public, along with weights, full source code, plus permissible licensing terms that allow for commercial use and further modification
I'd say the odds are slim in the near-term future, but honestly it's anyone's guess.
Vicuña in my experience is by far the best training for llama models [1]. Way way better than alpaca or base llama. Recently I tried the wizard+vicuña uncensored on llama and it was excellent. Can not wait to try that combination on llama2.
my favourite leaderboard is this one as it compares open source and closed source:
Hardly. I've played a lot with the 7,13, and 30B llamas as well as the 7 and 13B alpacas fine tuned by Stanford. They do not have emergent abilities like being able to generate rhymes or, say, represent a movie plot as emoji. Even openai's old text-davinci-003 (gpt3.5, but text completion, not the chat ones) far outperforms them. That said, I have hopes for a 65B 3-bit quantized alpaca-fine tuned. We'll see when someone spends the money to do the (more costly) 65B training. The alpacas are also much more likely to go off rails and start regurgitating their fine-tuning inputs. Either that or openai is doing a lot of post processing on their end to hide the same problems in their LLM.
For now my IRC bots run the alpaca 7B 4-bit. 13B was not a significant improvement for twice the computational time. But it's best to learn them now because as soon as openai gets sued for the first time all the turing test passing older models without the legal-butt-covering bolted on will be removed.
> Fine-tuning has one huge advantage though: it is far more effective at guiding a model's behavior than prompting, so you can often get away with a much smaller model. That gets you faster responses and lower inference costs. A fine-tuned Llama 7B model is 50x cheaper than GPT-3.5 on a per-token basis, and for many use cases can produce results that are as good or better!
These comparisons are reductive to the point of being misleading. Even with all the optimizations in the ecosystem, it's not trivial to get a finetuned 7B param model running at an acceptable inference latency. Even if you use a GPU such as an A100 for maximum speed, then you have scalability issues since A100s are scarce. Also, the "50% cheaper" assumes 100% utilization of a GPU which will never happen in production use cases.
Quality-wise, a finetuned Llama 2 is not necessairly better than ChatGPT. Finetuning requires a high-quality dataset which is not easy to construct. And in my own experience with finetuning Llama 2, qualitivately it caused more frustration to get outputs on par with just using ChatGPT.
The value of the ChatGPT API is more dependable scaling and not having to pay for an infra.
I’m a little out of date (busy few weeks), didn’t the Vicuna folks un-housebreak the LLaMA 2 language model (which is world class) with a slightly less father-knows-best Instruct tune?
AFAIK full model training should be a couple order magnitudes higher probably?
reply