Could you explain what you mean by "Chinchilla under-trained" or "Chinchilla over-trained"? I assume it refers to some measure of trained-ness, but Googling yielded nothing relevant.
My memory says that there’s a “Chinchilla” paper showing how to make the best model with a given training budget. There’s a trade-off between the amount of training data and the size of the model itself. Chinchilla under-training would mean that the model is too big for the amount of training data used. Llama is Chinchilla over-trained in that there is a ton of data relative to the small size of the model.
Note that this is still desirable for inference because you want the most possible training on whatever model you can actually fit in your memory.
Like the sibling comment said - the proportion of training tokens to parameter size is very important, and there's a certain threshold needed to be met for it to be "fully trained".
Usually you have a fixed amount of compute (budget/time essentially) - and in that case you want to pick the largest parameter count that you can fully train, and not the largest parameter count your hardware can support and then train that for less time.
tl;dr - Small models with training over the chinchilla threshold can out perform large models that are undertrained
EDIT: Figure 2 page 5, and Table 3 page 8 - might be worth checking out.
Google put out the Chinchilla paper last year, showing that GPT-3 and others could have gotten better at the same size by just shoving more tokens at them in further training loops. The paper showed some snazzy curves where more training time and data equalled better quality, and speculated / demonstrated that a lot more training tokens and time could get better quality out of smaller models than GPT-'s 175B.
The was, for a minute, ignored, because the PaLM paper came out very shortly thereafter which seemed to show, pretty conclusively, that there are unusual and exciting emergent behaviours coming out of much larger models, (PaLM is 540B parameters), and so that was hotter news.
In the meantime, some really smart folks looked at the Chinchilla curve, and were like "hmm. One way to think about this is to see that if you are willing to put a LOT more compute in upfront on a model, then the inference costs go down in some sub-linear function."
Llama's architectural instincts are that if you're going to give away a model, and it is going to get run on the edge, it might make sense to spend a whole, whole lot of compute, once, training something past what the paper considered optimal, and well into the point where the paper thought of it as "not worth it", precisely because the entire world might be able to run it if you can get something good and much smaller.
Conclusively, OPT and LLMs from its era are significantly 'under-trained' compared even to GPT-3, itself undertrained by something like an order of magnitude from where the Chinchilla paper implies they should be.
I guess I made up the phrase over and under-trained; their might be some other way to talk about it elsewhere. Sorry! :)
reply