Hacker Read top | best | new | newcomments | leaders | about | bookmarklet login

> is supposed to addrss that precise limitation, but the results show that it just doesn't do very well at all.

they demonstrate improvement from previous 5% to 7%.



sort by: page size:

> however they do not show that this approach produces similar accuracy as large LLMs.

I think they have demonstrated their case pretty well, unless there is some serious degradation of the scaling - 7b is pretty big.


> A nitpick, perhaps, but isn't that three orders of magnitude?

Perhaps the example was a best-case, and the usual improvement is about 10x. (That or 'order of magnitude' has gone the way of 'exponential' in popular use. I don't think I've noticed that elsewhere, though.)


> The title is completely accurate.

Accurate, but a half truth. The prize was for a 10% improvement, but before that solution was produced they had already improved by 8.4%. The headline makes it sound like the improvement from zero to 10% was not worth the engineering cost, but really it was the improvement from 8.4% to 10% which cost too much.


>Also, using any JIT-based tricks (PyPy / numba) results in very small gains (as we will measure, just to make sure).

i wasnt able to see the numba comparison. anyone know how much worse it was ?


>or corrects 30% in a few days,

Didn't it just do that?


I don't think it holds up to much scrutiny. I think it's mixing up "less improvement than expected for the effort" with "actually measurably worse".

The author states that widening the M-25 only achieved a 10% increase in throughput. This is not good and the definition of diminishing returns.

That's not what the numbers say. It's on the order of low dozens of percent improvement, not hundreds of percent.

> [ABCDE] has a horrible sensitivity (50%-90% depending on the type). But if this technique also only has a 50% sensitivity...

If the sensitivity of the two techniques are not correlated, a 50% improvement on top of ABCDE would be fantastic.


It's only a 16% range improvement in 3 years.

Nice claims currently, but we'll see which ones are correct/feasible.


I was replying to a comment that said it “seems fine.”

It does not seem fine.

It is incomprehensible and doesn’t match the results I’ve seen from 7B through 65B.

It is true that RLHF could improve it, and perhaps then this severe of optimization will seem fine.


> Sometimes the guess is right on the money and it's extremely effective, other years not so much.

Oh thats why some years it seems to work, others not, TIL.


it's an improvement (iirc, it makes the error-growth linear), but does not 'fix' the problem.

Agreed. Getting results which don't include the quoted terms is mind bogglingly useless.

I thought the whole point was to improve over time, not get worse. :(


> we'd expect little difference

I think that also likely holds for the quants, the difference could very well be within the error bars.

Anyway, it's been posted to r/locallama so I'm sure someone will try it within the hour and report back soon :P


> 10% was an example to simplify the model.

It also just so happened to quintuple his returns. Not exactly comparing apples to oranges when you have a 4% (realistic) v. 10% (unrealistic) return built into the model.


> that's a pretty big correction.

No, it isn't. It doesn't affect the core finding at all.


> but in the grand of scheme of things, 1% absolute improvement may not be such game-changer, especially if it comes at the cost of other relevant metrics like model complexity, developer sanity or performance

fasttext makes errors about 10% of the time, and our approach makes errors about 5% of the time. It's certainly fair to say (although nitpicky) that "accuracy" isn't quite the right term here (I should have said "half the error").

But as for your general sigh/rant... absolute improvement is very rarely the interesting measure. Relative improvement tells you how much your existing systems will change. So if you're error goes from 5% to 4% then you have 20% less errors to deal with than you used to.

An interesting example: the Kaggle Carvana segmentation competition had a lot of competitors complaining that the simple baseline models were so accurate that the competition was pointless (it was very easy to get 99% accuracy). The competition administrator explained however that the purpose of the segmentation model was to do automatic image pasting into new backgrounds, where every mis-classified pixel would lead to image problems (and in a million+ pixels, that's a low error rate!)


> we examine the contribution of more computing power to better outcomes

No, they pick a set of problems where computational methods are known to have a beneficial impact and the plot every progress in that field against increased amounts of computing. Since amount of computing power used is monotonous and ELO score/Go performance/weather prediction success is trending monotonous the correlation is pretty high. However computation power is not the only thing that rose mostly monotonically during that time. At best they derived an upper bound of the contribution of more computing power to better outcomes.

For example in Mixed Integer Linear Programming studies were done to measure algorithmic vs hardware speedup. "On average, we found out that for solving LP/MILP, computer hardware got about 20 times faster, and the algorithms improved by a factor of about nine for LP and around 50 for MILP, which gives a total speed-up of about 180 and 1,000 times, respectively." https://arxiv.org/abs/2206.09787 This methodology would attribute the 1000 times effect to the increase in FLOPs alone.

And just a methodological concern, taking the logarithm of one axis, is applying a non-linear transformation, and then doing a linear fit results in distorted measure of distance between fit and data depending on the data. This effect was not discussed. It does only mess with R value so i would not feel comfortable applying that R value to derive an attribution.

next

Legal | privacy