Hacker Read top | best | new | newcomments | leaders | about | bookmarklet login

Perhaps we will provide feedback to open source Llama using ChatGPT. The cost to adjust the model is presumably what's hard?


sort by: page size:

Llama itself performs comparably to GPT3.5 (at least 30/60g models), but the RLHF of chatgtp is much better than what the community has produced thus far, and it's tuned to work well without tinkering. There will be open source models with that level of fine tuning in the near future, at which point ChatGPT4 will mainly be superior for stuff like code that needs the best possible cohesion and accuracy.

Not what you're asking but Vicuna did cost merely 300$ to fine-tune on top of LLaMA https://www.marktechpost.com/2023/04/02/meet-vicuna-an-open-...

AFAIK full model training should be a couple order magnitudes higher probably?


> not really competitive with ChatGPT

That's impossible to judge. LLama is a foundational model. It has received neither instructional fine tuning (davinci-3) nor RLHF (ChatGPT). It cannot be compared to these finetuned models without, well, finetuning.


I have been fooling around with the small 7B llama models. They chat, but they are pretty dumb compared to ChatGPT. This means they are terser and they confabulate more, even for things that are common knowledge. It seems, from asking it questions about current events, that the model was trained up to data from early 2020.

I haven't seen much output yet from the biggest 65B parameter llama model. One can rent cloud VMs that can run it for $1.25 an hour or so on vast.ai to run it, but ChatGPT is $20 a month so why bother, unless you like the fully uncensored aspect.


> What we really need is a model that you can run on your own hardware on site

So, LLaMA? It's no chat gpt but it can potentially serve this purpose


How about alignment/ability to answer prompt queries and chain of thought reasoning capabilities?

Without the fine tuning RLHF phase to make it like instructgpt I'm assuming it won't be as good as ChatGPT, is that right?

How hard would it be to fine tune the 65B model on commodity hardware?

Found answer here:

> Out-of-scope use cases LLaMA is a base, or foundational, model. As such, it should not be used on downstream applications without further risk evaluation and mitigation. In particular, our model has not been trained with human feedback, and can thus generate toxic or offensive content, incorrect information or generally unhelpful answers.

https://github.com/facebookresearch/llama/blob/main/MODEL_CA...


Not really. I tried several fine tuned iterations of LLaMa and none of them are even close to ChatGPT.

Does anyone know how to estimate the cost of inference using your own Llama2 model? This article talks about the cost of fine tuning it, but not what to expect when running it in production for inference.

In particular, it would be great to know how the inference cost compares to gpt3.5 turbo and gpt4


You're likely right that applying RLHF (+ fine-tuning with instructions) to LLaMA 7b would produce results similar to ChatGPT, but I think you're implying that that would be feasible today.

RLHF requires a large amount of human feedback data and IIRC there's no open data set for that right now.


His estimate is that you could train a LLaMA-7B scale model for around $82,432 and then fine-tune it for a total of less than $85K. But when I saw the fine tuned LLaMA-like models they were worse in my opinion even than GPT-3. They were like GPT-2.5 or like that. Not nearly as good as ChatGPT 3.5 and certainly not ChatGPT-beating. Of course, far enough in the future you could certainly run one in the browser for $85K or much less, like even $1 if you go far enough into the future.

"Second, given the large gap between LLaMA and ChatGPT (the latter model is faster, cheaper, and more accurate), "

No it's not, llama would be cheaper and likely faster if you ran it on the same scale, actually there've been a few calcs done, that running llama 65b if you're at 100% usage is cheaper than 3.5turbo per token. Also comparing them for accuracy isn't fair comparison, one is a foundational model, one is an instruct tuned model. Perhaps compare llama 65b with gpt3.


> Fine-tuning has one huge advantage though: it is far more effective at guiding a model's behavior than prompting, so you can often get away with a much smaller model. That gets you faster responses and lower inference costs. A fine-tuned Llama 7B model is 50x cheaper than GPT-3.5 on a per-token basis, and for many use cases can produce results that are as good or better!

These comparisons are reductive to the point of being misleading. Even with all the optimizations in the ecosystem, it's not trivial to get a finetuned 7B param model running at an acceptable inference latency. Even if you use a GPU such as an A100 for maximum speed, then you have scalability issues since A100s are scarce. Also, the "50% cheaper" assumes 100% utilization of a GPU which will never happen in production use cases.

Quality-wise, a finetuned Llama 2 is not necessairly better than ChatGPT. Finetuning requires a high-quality dataset which is not easy to construct. And in my own experience with finetuning Llama 2, qualitivately it caused more frustration to get outputs on par with just using ChatGPT.

The value of the ChatGPT API is more dependable scaling and not having to pay for an infra.


The largest LLaMA model at ~60 billion parameters is not quite as large as ChatGPT 3 in size, and probably not quite as well trained, but it's basically in the same class. Even the complete, not quantitized model, can be run with llama.cpp on ARM64 and x86_64 CPUs already, assuming you have enough RAM (128 GB?).

I assume you are referring to Llama 2? Is there a way to compare models? e.g. what is Llama-7b equivalent to in OpenAI land? Perplexity scores?

Also, does ChatGPT use GPT 4 under the hood or 3.5?


LLaMA2 seems to compete with ChatGPT 3.5, which is great. It's nowhere near as large as GPT-4 so I would not expect it to be competitive with that.

GPT-4 level models that regular people can run with a reasonable hardware budget are going to require innovations in optimization and model efficiency beyond just quantizing weights. Rumor has it that GPT-4 is a "committee" of ~220G models, which would require ~128GiB VRAM at 4-bit quantization to run each model.


> and is trivial to install

The average person/studen can't install a llama 2 model (70 billion parameter) that can compete with ChatGPT-4 (the paid version), considering most students have notebooks, and lack top tier GPUs. Most would have to go to a "free" cloud providers.

ChatGPT-3.5 is also free, and compares favorably against the lower parameter llama 2 models. But, llama 1, lower parameter/heavily quantized llama 2, and ChatGPT-3.5, are not practically comparable to ChatGPT-4, for most tasks [1]. I haven't found a use for those, in this sort of context, except for increasing my blood pressure, with the time wasted being my justification for paying $.70/day.

[1] https://www.promptengineering.org/how-does-llama-2-compare-t....


Vicuna looks pretty good. But as said, commercial use not possible.

Why do you think that Llama can be replaced? I mean it is extremely costly to train that thing. And it is there even a clean open source data set for the task?

PS: wouldn't be surprised if Meta, OpenAi, or google will train something for a Billion $ in costs of compute or more.


This is the same Pythia and Llama based models right?

If so, they certainly aren't ChatGPT level in their quality. Impressive, potentially useful, but not ChatGPT.

Still an incredible effort, the RLHF data here might eventually make an Open Source ChatGPT possible, but these models are not that.


Llama 2 wasn't trained on ChatGPT/GPT4. I think maybe you are thinking of the Vicuna models?

https://lmsys.org/blog/2023-03-30-vicuna/

next

Legal | privacy