Hacker Read top | best | new | newcomments | leaders | about | bookmarklet login

I haven't read it closely, but it looks to me as if the calculation estimates the expected number of counterexamples rather than the "measure" of them (however you've chosen to define that).


sort by: page size:

They said it "should" go down, but that another comment saying the worst case is the same is "also correct".

I do not see any "complete nonsense" here. I suppose they should have used a different word from "tolerance" for the expected value, but that's pretty nitpicky!


I'm genuinely perplexed.

The headline seems to differ from the rest. They seem to suggest that the preprint paper is entirely accurate and that our previous calculation method was incorrect.

Am I interpreting this correctly? Is this just a bad and missleading headline?


A footnote in the article notes that the original "faulty" formula and the new "correct" one are asymptotically equivalent.

> A nitpick, perhaps, but isn't that three orders of magnitude?

Perhaps the example was a best-case, and the usual improvement is about 10x. (That or 'order of magnitude' has gone the way of 'exponential' in popular use. I don't think I've noticed that elsewhere, though.)


Exactly.

The article is an exercise in optimizing for the wrong metric.


All of this just means that the author's estimate is an upper-bound estimate.

I'm not equipped to evaluate, but a quick Google seems to suggest this result is against the majority of other papers. That's not to say it's wrong, just that reading most other papers on the subject would suggest something else

> "we're not exactly sure how these numbers (hyper parameters) affect the result, so just try a bunch of different values and see which one works best."

Isn't it the same for anything that uses a Monte Carlo simulation to find a value? At times you'll end up on a local maxima (instead of the best/correct) answer, but it works.

We cannot solve something used a closed formula so we just do a billion (or whatever) random samplings and find what we're after.

I'm not saying it's the same for LLMs but "trying a bunch of different values and see which one works best" is something we do a lot.


The model we are comparing against makes 10X as many errors.

I hadn't imagined someone would argue that's not a meaningful difference.

Though the difference is statistically significant too.


Thanks for taking the time to dig this out, appreciated. I've been reviewing it this morning. The formulas presented in the "measuring success part" so far, though interesting, seem to be arbitrary. For example the question of whether an agent should research for efficiency or pick the low hanging fruits for short time benefit is answered through a simple sum formula. Another example is when the author(s) state that universal intelligence should favor simpler choices and interact with the environment to cause less complexity. Well that's obvious! Just use a binary inverse logarithmic distributive operator. With the comical response to criticisms part (starting from 5.2) I feel like I'm in a Douglas Adams movie.

> 10% was an example to simplify the model.

It also just so happened to quintuple his returns. Not exactly comparing apples to oranges when you have a 4% (realistic) v. 10% (unrealistic) return built into the model.


specifically the first "underestimate" is wrong, the second is correct. :-)

Since the paper is about the new method, one assumes the examples show the new method's results. The first impression, is that the new method isn't very effective.

> this is an improvement, but not an exponential one.

I wonder how you define exponential here. If the old version had a probability 20% of losing against Lee Sedol, and the new one has 5%, then one might call it exponential. Something like losing prob = 2^(2012-current year).


Yes, I saw the figure and that's why I commented that the day 2 conversions for the control are basically giving all of the information in the assumed model.

To me it just looks like a whole new batch of assumptions. Might be fictitiously valid or not.


> arbitrarily correcting old measures.

Why do you say the correction is arbitrary? Are there papers arguing for corrections in the other direction?


The biggest problem is shown by the article saying that the results are "counter-intuitive". That certainly isn't my experience.

Update - I'm still cautious about this paper, but I had the table numbers inverted in my head while thinking about it. The paper shows better perplexity results than competing models at larger parameter sizes, so I was wrong.

> we'd expect little difference

I think that also likely holds for the quants, the difference could very well be within the error bars.

Anyway, it's been posted to r/locallama so I'm sure someone will try it within the hour and report back soon :P

next

Legal | privacy