10x the parameters? Maybe not in a single model, but maybe 10x the expert models has 10x the value. I'm sure there are diminishing returns eventually, but we're probably not close to that.
I don't care how many parameters my model has per se. What I care about is how expensive it is to train in time and dollars. If this makes it cheaper to train better models despite more parameters, that's still a win.
I'm not sure whether the number of parameters serves as a reliable measure of quality. I believe that these models have a lot of redundant computation and could be a lot smaller without losing quality.
that's a complicated question to answer. what I'd say is that more parameters makes the model more robust, but there are diminishing returns. optimizations are under way
That's not sufficient. If you write 10 different models, each with 10s of thousands of parameters until you get the results you want, it doesn't matter how accurately the model seems to predict in the past. Modelling is a tricky business that easily falls prey to such "data snooping" methods.
Isn't there more and more research coming out that at a certain point (200B~), parameters have significantly decreasing returns and it's better to just then do some supervised learning ontop of the base model?
It’s not reasonably possible (currently?) to get the same performance from a 7 billion parameter model as a 175 billion parameter model with just an additional 6000 lines of finetuning data.
Doubtful, for purely information theoretic and memory capacity reasons. It may outperform on some synthetic metrics, but in practice, to a human, larger models just feel “smarter” because they have a lot more density in their long tail where metrics never go
Yes and no. We don't need an insane amount of data to make these models accurate, if you have a small set of data that includes the benchmark questions they'll be "quite accurate" under examination.
The problem is not the amount of data, it's the quality of the data, full stop. Beyond that, there's something called the "No Free Lunch Theorem" that says that a fixed parameter model can't be good at everything, so trying to make a model smarter at one thing is going to make it dumber at another thing.
We'd be much better off training smaller models for specific domains and training an agent that can use tools deepmind style.
The algorithms probably aren't that great, and more of them would likely have diminishing returns. Adding substantially more false positives could actually be a bad thing.
As someone else mentioned. The 35k parameters is skeptical. Taleb and Tversky and Kahneman have good evidence that most algorithms are better with less parameters. The more parameters, the more noise.
reply