Hacker Read

alimw · 2017-08-06 16:54:53

I haven't read it closely, but it looks to me as if the calculation estimates the expected number of counterexamples rather than the "measure" of them (however you've chosen to define that).

Dylan16807 | karma 31639 | avg karma 1.39 · | 2024-06-04 20:48:13

They said it "should" go down, but that another comment saying the worst case is the same is "also correct".

I do not see any "complete nonsense" here. I suppose they should have used a different word from "tolerance" for the expected value, but that's pretty nitpicky!

reply

tlogan | karma 4531 | avg karma 3.48 · | 2023-08-11 10:02:36

I'm genuinely perplexed.

The headline seems to differ from the rest. They seem to suggest that the preprint paper is entirely accurate and that our previous calculation method was incorrect.

Am I interpreting this correctly? Is this just a bad and missleading headline?

reply

tom_mellior | karma 4934 | avg karma 2.07 · | 2020-07-25 08:22:14+00:00

A footnote in the article notes that the original "faulty" formula and the new "correct" one are asymptotically equivalent.

retsibsi | karma 1442 | avg karma 3.05 · | 2020-10-09 09:26:05+00:00

> A nitpick, perhaps, but isn't that three orders of magnitude?

Perhaps the example was a best-case, and the usual improvement is about 10x. (That or 'order of magnitude' has gone the way of 'exponential' in popular use. I don't think I've noticed that elsewhere, though.)

reply

nieksand | karma 1092 | avg karma 6.31 · | 2016-04-19 17:38:50

Exactly.

The article is an exercise in optimizing for the wrong metric.

reply

caf | karma 12991 | avg karma 2.9 · | 2018-04-19 05:14:01+00:00

All of this just means that the author's estimate is an upper-bound estimate.

space_fountain | karma | avg karma · | 2021-08-26 12:21:13

I'm not equipped to evaluate, but a quick Google seems to suggest this result is against the majority of other papers. That's not to say it's wrong, just that reading most other papers on the subject would suggest something else

TacticalCoder | karma 9117 | avg karma 3.7 · | 2024-01-22 15:12:53

> "we're not exactly sure how these numbers (hyper parameters) affect the result, so just try a bunch of different values and see which one works best."

Isn't it the same for anything that uses a Monte Carlo simulation to find a value? At times you'll end up on a local maxima (instead of the best/correct) answer, but it works.

We cannot solve something used a closed formula so we just do a billion (or whatever) random samplings and find what we're after.

I'm not saying it's the same for LLMs but "trying a bunch of different values and see which one works best" is something we do a lot.

reply

dbecker | karma 1340 | avg karma 2.59 · | 2017-06-05 13:47:07+00:00

The model we are comparing against makes 10X as many errors.

I hadn't imagined someone would argue that's not a meaningful difference.

Though the difference is statistically significant too.

reply

nurettin | karma 3395 | avg karma 1.38 · | 2017-09-15 06:20:12+00:00

Thanks for taking the time to dig this out, appreciated. I've been reviewing it this morning. The formulas presented in the "measuring success part" so far, though interesting, seem to be arbitrary. For example the question of whether an agent should research for efficiency or pick the low hanging fruits for short time benefit is answered through a simple sum formula. Another example is when the author(s) state that universal intelligence should favor simpler choices and interact with the environment to cause less complexity. Well that's obvious! Just use a binary inverse logarithmic distributive operator. With the comical response to criticisms part (starting from 5.2) I feel like I'm in a Douglas Adams movie.

pc86 | karma 24701 | avg karma 2.58 · | 2015-08-12 18:15:12+00:00

> 10% was an example to simplify the model.

It also just so happened to quintuple his returns. Not exactly comparing apples to oranges when you have a 4% (realistic) v. 10% (unrealistic) return built into the model.

reply

jasallen | karma 1044 | avg karma 4.3 · | 2013-12-11 19:49:05

specifically the first "underestimate" is wrong, the second is correct. :-)

stevewilhelm | karma 4837 | avg karma 4.57 · | 2012-07-23 15:05:41+00:00

Since the paper is about the new method, one assumes the examples show the new method's results. The first impression, is that the new method isn't very effective.

thomasahle | karma 6095 | avg karma 3.0 · | 2017-05-26 18:59:06

> this is an improvement, but not an exponential one.

I wonder how you define exponential here. If the old version had a probability 20% of losing against Lee Sedol, and the new one has 5%, then one might call it exponential. Something like losing prob = 2^(2012-current year).

reply

knuthsat | karma 774 | avg karma 2.01 · | 2022-01-07 10:59:12

Yes, I saw the figure and that's why I commented that the day 2 conversions for the control are basically giving all of the information in the assumed model.

To me it just looks like a whole new batch of assumptions. Might be fictitiously valid or not.

reply

lolc | karma 4481 | avg karma 2.32 · | 2024-02-06 02:51:41

> arbitrarily correcting old measures.

Why do you say the correction is arbitrary? Are there papers arguing for corrections in the other direction?

reply

quadrangle | karma 3931 | avg karma 2.14 · | 2014-06-18 15:14:57

The biggest problem is shown by the article saying that the results are "counter-intuitive". That certainly isn't my experience.

vessenes | karma 7855 | avg karma 5.78 · | 2024-02-28 20:41:18

Update - I'm still cautious about this paper, but I had the table numbers inverted in my head while thinking about it. The paper shows better perplexity results than competing models at larger parameter sizes, so I was wrong.

moffkalast | karma 7759 | avg karma 1.88 · | 2023-07-25 05:00:26

> we'd expect little difference

I think that also likely holds for the quants, the difference could very well be within the error bars.

Anyway, it's been posted to r/locallama so I'm sure someone will try it within the hour and report back soon :P

reply