In my testing, the chat and instruct-tuned versions of MPT-30B is very close to 3.5 for many tasks, but of course the team who made it got bought up immediately and it’s licensed only for non-commercial use. I’m hoping the open source community runs with the base model in the same way they did with LLaMA.
Going purely by the benchmarks from OP - you can essentially consider MPT equivalent to LLaMa. It might be better/worse depending on the specific task but not by much.
So compared to GPT3.5 - it's not great at all. That said, LLaMa showed significant improvements via fine-tuning and I expect those to apply here as well.
EDIT: Oh I forgot this is 7B. I personally haven't spent much time with 7B llama because my hardware can do 15/30B - and honestly 15B llama is very noticably better to the point where if you can run it you shouldn't bother with 7B. So this really can't compare to GPT3.5 without finetuning and even then it'll be behind (based on llama models)
Yeah - It's at par, if not better than GPT 3.5! This is the base model (13B) with no fine-tuning or censorship.
Feel free to give it a spin for code as well! We're just the infrastructure layer here so we don't use any data for retraining these models. LLaMA 2 70B coming soon! :D
Llama itself performs comparably to GPT3.5 (at least 30/60g models), but the RLHF of chatgtp is much better than what the community has produced thus far, and it's tuned to work well without tinkering. There will be open source models with that level of fine tuning in the near future, at which point ChatGPT4 will mainly be superior for stuff like code that needs the best possible cohesion and accuracy.
It's interesting that they've appeared to have undertrained their 30B model at least compared to LLama/Falcon.
The coding ability performed better, but it's still far behind WizardCoder which is half the size - of course WizardCoder wasn't released why they started training MPT-30B.
The 8k context is an interesting addition. Are there any standard benchmarks to show how coherently models perform at different context lengths - 1k, 2k, 4k, 8k, etc?
Isn't the chat version of llama 2 trained on gpt-4 output, hence it's non-commercial license (as opposed to the base model) or am I just making things up?
This assessment is based largely on GPT-4 evaluation of the output. In actual use, Vicuna-13B isn't even as good as GPT-3.5, although I do have high hopes for 30B if and when they decide to make that available (or someone else trains it, since the dataset is out).
And don't forget that all the LLaMA-based models only have 2K context size. It's good enough for random chat, but you quickly bump into it for any sort of complicated task solving or writing code. Increasing this to 4K - like GPT-3.5 has - would require significantly more RAM for the same model size.
I have been playing with all the local LLaMA models, and in my experience, the gains that are touted are often very misleading (e.g. people claiming that 13B can be as good as ChatGPT-3.5; it is absolutely not) and/or refer to synthetic testing that doesn't seem to translate well to actual use. Using GPT to generate training data for fine-tuning seems to produce the best results, but even so, GPT4-x-Alpaca 30B is still clearly inferior to the real thing. In general, the gap between 13B and 30B for any LLaMA-derived model is pretty big, and I've yet to see any fine-tuned model at 13B work better than plain llama-30b in actual use.
So I think that 65B may be a realistic estimate here assuming that OpenAI does indeed have some secret sauce for training that's substantially better, but below that I'm very skeptical (but still hope I'm wrong - I'd love to have GPT-3.5 level of performance running locally!).
The jump between llama 13B and 30B is quite significant. And their instruction finetuning is not SOTA I don't think, though the point about general knowledge is a good one: instruction llama lies very confidently.
But one great thing about open source LLMs is that you can specialize them in various tasks with affordable LORA training, enough to easily beat GPT4 in a specific niche.
You can run llama-30b right now on high-end consumer hardware (RTX 3090+) using int4 quantization. With two GPUs, llama-65b is within reach. And even 30b is surprisingly good, although it's clearly not as well trained as ChatGPT specifically for dialog-like task setting.
It's not clear from the GitHub; are there any plans to eventually train the 30 or 65 billion weight LLaMA models? The 65B model seems comparable to GPT3.5 for many things, and can run fine on a beefy desktop just on CPU (CPU ram is much cheaper than GPU ram). It'd be amazing to have an open source version.
I'm curious if this will give better results than llama 7B? Llama 7B felt like a toy that, while cool to be able to run locally, did not feel useful in any way when contrasted to the state of GPT. Here's hoping for better and/or release of larger parameter models with low performance requirements soon :)
EDIT: my first question times out when ran online, seems like huggingface is getting hugged to death.
It's fantastic that more orgs are releasing open-source models trained on more than 300B or so tokens. Here's my take from the details I could find.
Pros
- 4096 context width (vs 2048 for llama, gpt-j, etc)
- 3B to 65B released or in progress
- RL tuned models available
- Trained on more tokens than existing non-llama models
- 128 head dim, so can use flash attention (unlike GPT-J)
Cons
- No benchmarks released, or details about the model
- Somewhat restrictive license on the base models, and NC license on the RL models
- Small models only trained on 800B tokens, compared to 1T for llama-7B, and potentially more for other upcoming alternatives (RedPajama, etc). I'd like to see their loss curves to see why they chose 800B.
High-level, this is likely to be more accurate than existing non-llama open source models. It's hard to say without benchmarks (but benchmarks have been gamed by training on benchmark data, so really it's just hard to say).
Some upcoming models in the next few weeks may be more accurate than this, and have less restrictive licenses. But this is a really good option nonetheless.
I've not done any real benchmarking but the OpenAssistant fine tuning from LAION has been done on it. It worked reasonably well for something local but definitely felt like it wasn't nearly as complete/advanced as any of the ChatGPT stuff. I imagine this Databricks setup is more complete there but I personally wouldn't expect too much more than GPT-3 level performance. That said if this dataset is open (I haven't really looked too much at the article yet) then you could quite easily use it to tune LLaMA just like the stanford alpaca models, which might be a better combo. Though that wouldn't be licensed for commercial use then given the underlying license.
I'm playing with the llama 3 8b instruct model out of curiosity, and it is insanely better than the llama 2 on that regard. It's almost like a fully uncensored model. it did refused to make pentest scripts when i asked, which is fine. But it made the scripts after i changed the system prompt to something more 'permissive'. The model seem to adhere more to user commands, and it's more useful overall. It's even good at complex math, which is insane considering even GPT4 is bad at it.
I wasn't sure if meta would release the model to the public, i'm glad they did.
LLaMA2 seems to compete with ChatGPT 3.5, which is great. It's nowhere near as large as GPT-4 so I would not expect it to be competitive with that.
GPT-4 level models that regular people can run with a reasonable hardware budget are going to require innovations in optimization and model efficiency beyond just quantizing weights. Rumor has it that GPT-4 is a "committee" of ~220G models, which would require ~128GiB VRAM at 4-bit quantization to run each model.
Llama 70B is at GPT-3.5's level! It's not a perfect comparison, but I built this site to compare the two and Llama 70B won (https://llmboxing.com/leaderboard).
reply