Hacker Read top | best | new | newcomments | leaders | about | bookmarklet login

10x the parameters? Maybe not in a single model, but maybe 10x the expert models has 10x the value. I'm sure there are diminishing returns eventually, but we're probably not close to that.


sort by: page size:

When will we reach an upper limit/dimishing returns in terms of number of parameters and mixture of experts?

I don't care how many parameters my model has per se. What I care about is how expensive it is to train in time and dollars. If this makes it cheaper to train better models despite more parameters, that's still a win.

I'm not sure whether the number of parameters serves as a reliable measure of quality. I believe that these models have a lot of redundant computation and could be a lot smaller without losing quality.

that's a complicated question to answer. what I'd say is that more parameters makes the model more robust, but there are diminishing returns. optimizations are under way

That's not sufficient. If you write 10 different models, each with 10s of thousands of parameters until you get the results you want, it doesn't matter how accurately the model seems to predict in the past. Modelling is a tricky business that easily falls prey to such "data snooping" methods.

Isn't there more and more research coming out that at a certain point (200B~), parameters have significantly decreasing returns and it's better to just then do some supervised learning ontop of the base model?

Parameters don't have diminishing returns so much as we don't have enough (distinct) data to train models to use that many parameters efficiently.

In my machine learning experience, if it only takes 10x the parameters brings a significant improvement I feel lucky.

If your model has 17 billion parameters, you missed some.

Parameter count seems to only matter for range of skills, but these smaller models can be tuned to be more than competitive with far larger models.

I suspect the future is going to be owned by lots of smaller more specific models, possibly trained by much larger models.

These smaller models have the advantage of faster and cheaper inference.


It’s not reasonably possible (currently?) to get the same performance from a 7 billion parameter model as a 175 billion parameter model with just an additional 6000 lines of finetuning data.

Yes, because it turns out if you have a more reasonable number of parameters, but train for longer, the outcome is a more efficient model

I think there will be diminishing utility of “smarter” models

People won’t need them

They’ll need more efficient models: smaller parameter count, faster output, thats just as smart as some current baseline benchmark


Maybe but the point is they'll need to know vastly less data than the smorgasbord of hyper-parameters in use today.

It has more parameters, but not all of them are used during inference. They compared models that use equal numbers of parameters.

How much data do you need to mitigate the risk of over fitting a trillion parameter model?

Doubtful, for purely information theoretic and memory capacity reasons. It may outperform on some synthetic metrics, but in practice, to a human, larger models just feel “smarter” because they have a lot more density in their long tail where metrics never go

Yes and no. We don't need an insane amount of data to make these models accurate, if you have a small set of data that includes the benchmark questions they'll be "quite accurate" under examination.

The problem is not the amount of data, it's the quality of the data, full stop. Beyond that, there's something called the "No Free Lunch Theorem" that says that a fixed parameter model can't be good at everything, so trying to make a model smarter at one thing is going to make it dumber at another thing.

We'd be much better off training smaller models for specific domains and training an agent that can use tools deepmind style.


The algorithms probably aren't that great, and more of them would likely have diminishing returns. Adding substantially more false positives could actually be a bad thing.

As someone else mentioned. The 35k parameters is skeptical. Taleb and Tversky and Kahneman have good evidence that most algorithms are better with less parameters. The more parameters, the more noise.

next

Legal | privacy