Hacker Read top | best | new | newcomments | leaders | about | bookmarklet login

The existing models are technically bad candidates for this, as they will inevitably produce tests they can more effectively answer themselves.


sort by: page size:

If language model that couldn't do the work passes your test then it's a bad test.

Maybe we should come up with a new testing model then. As it stand the current model is akin to a game show like Who Wants to be a Millionaire.

This whole article could be boiled down to: "bad tests are useless".

It is possible to make useless tests.

Actually, those fail the "state-of-the-art" part of the test.

Some approaches aren't very testable and produce either very bad tests or can't be tested at all.

They are generally unlikely to have good enough test cases to make sure they implement these the same, as they are experimental.

Tests are akin to scientific experiments. They test hypothesis and try to falsify claims. They shouldn't be seen as ground truth, but ways to gain information about what the system claims to be doing. In this sense it makes sense that tests will become obsolete or evolve with the system, because the model and domain upon which the system is based also evolves and changes with time.

I think the tests should adapt to the design, not the design to the tests..

Tests must not be robust enough, then.

Flawed tests.

In this sort of scenario, the bugs lie in the expectations themselves. Tests that don’t account for that are dead weight.

The article lay the testability issue, but offers no alternative... Since as far as i know there are none.

Also article is kinda crap as it fails to understand the reason they are truly necessary. But upvoting to generate discussion here.


It not a good test. We'd be better off with benchmarks, testing capability updated every year to measure performance, ability and alignment!

These makes only sense if you have proper test data or test data generation.

This conspiracy always comes up - don't you think that they test the output of the model revisions on probably 1000s of downstream tasks at this point? Bad responses are hard to reason about, could be prompting, could be a model revision, could just be bad luck.

Changing a testing approach in my opinion is fine, but should result in the retroactive re-testing of previous models used in any in-article benchmark comparisons.

What I have trouble with is even gauging the quality of existing tests.

Which is insufficient in terms of technical tests.
next

Legal | privacy