I haven't read it closely, but it looks to me as if the calculation estimates the expected number of counterexamples rather than the "measure" of them (however you've chosen to define that).
They said it "should" go down, but that another comment saying the worst case is the same is "also correct".
I do not see any "complete nonsense" here. I suppose they should have used a different word from "tolerance" for the expected value, but that's pretty nitpicky!
The headline seems to differ from the rest. They seem to suggest that the preprint paper is entirely accurate and that our previous calculation method was incorrect.
Am I interpreting this correctly? Is this just a bad and missleading headline?
> A nitpick, perhaps, but isn't that three orders of magnitude?
Perhaps the example was a best-case, and the usual improvement is about 10x. (That or 'order of magnitude' has gone the way of 'exponential' in popular use. I don't think I've noticed that elsewhere, though.)
I'm not equipped to evaluate, but a quick Google seems to suggest this result is against the majority of other papers. That's not to say it's wrong, just that reading most other papers on the subject would suggest something else
> "we're not exactly sure how these numbers (hyper parameters) affect the result, so just try a bunch of different values and see which one works best."
Isn't it the same for anything that uses a Monte Carlo simulation to find a value? At times you'll end up on a local maxima (instead of the best/correct) answer, but it works.
We cannot solve something used a closed formula so we just do a billion (or whatever) random samplings and find what we're after.
I'm not saying it's the same for LLMs but "trying a bunch of different values and see which one works best" is something we do a lot.
Thanks for taking the time to dig this out, appreciated. I've been reviewing it this morning. The formulas presented in the "measuring success part" so far, though interesting, seem to be arbitrary. For example the question of whether an agent should research for efficiency or pick the low hanging fruits for short time benefit is answered through a simple sum formula. Another example is when the author(s) state that universal intelligence should favor simpler choices and interact with the environment to cause less complexity. Well that's obvious! Just use a binary inverse logarithmic distributive operator. With the comical response to criticisms part (starting from 5.2) I feel like I'm in a Douglas Adams movie.
It also just so happened to quintuple his returns. Not exactly comparing apples to oranges when you have a 4% (realistic) v. 10% (unrealistic) return built into the model.
Since the paper is about the new method, one assumes the examples show the new method's results. The first impression, is that the new method isn't very effective.
> this is an improvement, but not an exponential one.
I wonder how you define exponential here. If the old version had a probability 20% of losing against Lee Sedol, and the new one has 5%, then one might call it exponential. Something like losing prob = 2^(2012-current year).
Yes, I saw the figure and that's why I commented that the day 2 conversions for the control are basically giving all of the information in the assumed model.
To me it just looks like a whole new batch of assumptions. Might be fictitiously valid or not.
Update - I'm still cautious about this paper, but I had the table numbers inverted in my head while thinking about it. The paper shows better perplexity results than competing models at larger parameter sizes, so I was wrong.
reply