Hacker Read

riku_iki · 2022-07-28 14:31:51

> is supposed to addrss that precise limitation, but the results show that it just doesn't do very well at all.

they demonstrate improvement from previous 5% to 7%.

whimsicalism | karma 14467 | avg karma 2.11 · | 2023-07-22 23:11:48

> however they do not show that this approach produces similar accuracy as large LLMs.

I think they have demonstrated their case pretty well, unless there is some serious degradation of the scaling - 7b is pretty big.

reply

retsibsi | karma 1442 | avg karma 3.05 · | 2020-10-09 09:26:05+00:00

> A nitpick, perhaps, but isn't that three orders of magnitude?

Perhaps the example was a best-case, and the usual improvement is about 10x. (That or 'order of magnitude' has gone the way of 'exponential' in popular use. I don't think I've noticed that elsewhere, though.)

reply

DougBTX | karma 4497 | avg karma 2.54 · | 2012-04-13 19:12:13

> The title is completely accurate.

Accurate, but a half truth. The prize was for a 10% improvement, but before that solution was produced they had already improved by 8.4%. The headline makes it sound like the improvement from zero to 10% was not worth the engineering cost, but really it was the improvement from 8.4% to 10% which cost too much.

reply

sandGorgon | karma 11984 | avg karma 2.81 · | 2023-03-30 01:17:01

>Also, using any JIT-based tricks (PyPy / numba) results in very small gains (as we will measure, just to make sure).

i wasnt able to see the numba comparison. anyone know how much worse it was ?

reply

ruvis | karma 24 | avg karma 1.5 · | 2017-11-17 07:28:56+00:00

>or corrects 30% in a few days,

Didn't it just do that?

reply

tedunangst | karma 26000 | avg karma 2.74 · | 2024-06-22 01:35:47

I don't think it holds up to much scrutiny. I think it's mixing up "less improvement than expected for the effort" with "actually measurably worse".

zip1234 | karma 2460 | avg karma 2.5 · | 2024-02-13 15:50:19

The author states that widening the M-25 only achieved a 10% increase in throughput. This is not good and the definition of diminishing returns.

Maursault | karma 2251 | avg karma 0.88 · | 2023-01-29 13:53:29

That's not what the numbers say. It's on the order of low dozens of percent improvement, not hundreds of percent.

robocat | karma 11778 | avg karma 2.08 · | 2017-11-09 21:02:37+00:00

> [ABCDE] has a horrible sensitivity (50%-90% depending on the type). But if this technique also only has a 50% sensitivity...

If the sensitivity of the two techniques are not correlated, a 50% improvement on top of ABCDE would be fantastic.

reply

NicoJuicy | karma 10294 | avg karma 1.47 · | 2020-09-23 03:25:37+00:00

It's only a 16% range improvement in 3 years.

Nice claims currently, but we'll see which ones are correct/feasible.

reply

refulgentis | karma 3142 | avg karma 1.08 · | 2023-03-15 13:47:30

I was replying to a comment that said it “seems fine.”

It does not seem fine.

It is incomprehensible and doesn’t match the results I’ve seen from 7B through 65B.

It is true that RLHF could improve it, and perhaps then this severe of optimization will seem fine.

reply

chadcmulligan | karma 2776 | avg karma 2.24 · | 2020-12-05 02:09:37+00:00

> Sometimes the guess is right on the money and it's extremely effective, other years not so much.

Oh thats why some years it seems to work, others not, TIL.

reply

toolslive | karma 2539 | avg karma 2.44 · | 2019-03-31 08:15:39

it's an improvement (iirc, it makes the error-growth linear), but does not 'fix' the problem.

justinclift | karma 11995 | avg karma 1.81 · | 2020-11-18 06:06:17+00:00

Agreed. Getting results which don't include the quoted terms is mind bogglingly useless.

I thought the whole point was to improve over time, not get worse. :(

reply

moffkalast | karma 7759 | avg karma 1.88 · | 2023-07-25 05:00:26

> we'd expect little difference

I think that also likely holds for the quants, the difference could very well be within the error bars.

Anyway, it's been posted to r/locallama so I'm sure someone will try it within the hour and report back soon :P

reply

pc86 | karma 24701 | avg karma 2.58 · | 2015-08-12 18:15:12+00:00

> 10% was an example to simplify the model.

It also just so happened to quintuple his returns. Not exactly comparing apples to oranges when you have a 4% (realistic) v. 10% (unrealistic) return built into the model.

reply

tablespoon | karma 11990 | avg karma 2.97 · | 2022-04-08 08:55:05

> that's a pretty big correction.

No, it isn't. It doesn't affect the core finding at all.

reply

jph00 | karma 7468 | avg karma 8.69 · | 2018-05-15 21:15:26+00:00

> but in the grand of scheme of things, 1% absolute improvement may not be such game-changer, especially if it comes at the cost of other relevant metrics like model complexity, developer sanity or performance

fasttext makes errors about 10% of the time, and our approach makes errors about 5% of the time. It's certainly fair to say (although nitpicky) that "accuracy" isn't quite the right term here (I should have said "half the error").

But as for your general sigh/rant... absolute improvement is very rarely the interesting measure. Relative improvement tells you how much your existing systems will change. So if you're error goes from 5% to 4% then you have 20% less errors to deal with than you used to.

An interesting example: the Kaggle Carvana segmentation competition had a lot of competitors complaining that the simple baseline models were so accurate that the competition was pointless (it was very easy to get 99% accuracy). The competition administrator explained however that the purpose of the segmentation model was to do automatic image pasting into new backgrounds, where every mis-classified pixel would lead to image problems (and in a million+ pixels, that's a low error rate!)

reply

freemint | karma 1928 | avg karma 1.47 · | 2022-07-30 09:25:33

> we examine the contribution of more computing power to better outcomes

No, they pick a set of problems where computational methods are known to have a beneficial impact and the plot every progress in that field against increased amounts of computing. Since amount of computing power used is monotonous and ELO score/Go performance/weather prediction success is trending monotonous the correlation is pretty high. However computation power is not the only thing that rose mostly monotonically during that time. At best they derived an upper bound of the contribution of more computing power to better outcomes.

For example in Mixed Integer Linear Programming studies were done to measure algorithmic vs hardware speedup. "On average, we found out that for solving LP/MILP, computer hardware got about 20 times faster, and the algorithms improved by a factor of about nine for LP and around 50 for MILP, which gives a total speed-up of about 180 and 1,000 times, respectively." https://arxiv.org/abs/2206.09787 This methodology would attribute the 1000 times effect to the increase in FLOPs alone.

And just a methodological concern, taking the logarithm of one axis, is applying a non-linear transformation, and then doing a linear fit results in distorted measure of distance between fit and data depending on the data. This effect was not discussed. It does only mess with R value so i would not feel comfortable applying that R value to derive an attribution.

reply