Hacker Read

brucethemoose2 · 2023-09-27 22:05:54

The existing models are technically bad candidates for this, as they will inevitably produce tests they can more effectively answer themselves.

scotty79 | karma 14043 | avg karma 1.62 · | 2023-02-05 11:20:02

If language model that couldn't do the work passes your test then it's a bad test.

alanwatts | karma 375 | avg karma 1.03 · | 2016-03-29 08:53:39

Maybe we should come up with a new testing model then. As it stand the current model is akin to a game show like Who Wants to be a Millionaire.

philjackson | karma 2322 | avg karma 4.79 · | 2010-05-25 08:09:29+00:00

This whole article could be boiled down to: "bad tests are useless".

R0b0t1 | karma 2066 | avg karma 1.31 · | 2022-08-03 20:53:07

It is possible to make useless tests.

bzbarsky | karma 8095 | avg karma 3.01 · | 2013-10-23 02:47:45+00:00

Actually, those fail the "state-of-the-art" part of the test.

PaulKeeble | karma 4401 | avg karma 5.57 · | 2017-09-11 11:42:14+00:00

Some approaches aren't very testable and produce either very bad tests or can't be tested at all.

justincormack | karma 11120 | avg karma 2.4 · | 2012-04-28 14:00:28+00:00

They are generally unlikely to have good enough test cases to make sure they implement these the same, as they are experimental.

gchamonlive | karma 1702 | avg karma 2.18 · | 2024-07-01 13:12:39

Tests are akin to scientific experiments. They test hypothesis and try to falsify claims. They shouldn't be seen as ground truth, but ways to gain information about what the system claims to be doing. In this sense it makes sense that tests will become obsolete or evolve with the system, because the model and domain upon which the system is based also evolves and changes with time.

Chris2048 | karma 2919 | avg karma 0.51 · | 2016-03-18 17:26:54+00:00

I think the tests should adapt to the design, not the design to the tests..

earenndil | karma 1974 | avg karma 1.51 · | 2019-03-28 02:47:35+00:00

Tests must not be robust enough, then.

sdurkin | karma 1060 | avg karma 3.75 · | 2008-10-28 00:17:01

Flawed tests.

millstone | karma 6895 | avg karma 3.52 · | 2018-02-14 08:34:19+00:00

In this sort of scenario, the bugs lie in the expectations themselves. Tests that don’t account for that are dead weight.

gcb0 | karma 5864 | avg karma 1.39 · | 2013-07-29 11:56:19

The article lay the testability issue, but offers no alternative... Since as far as i know there are none.

Also article is kinda crap as it fails to understand the reason they are truly necessary. But upvoting to generate discussion here.

reply

amrb | karma 382 | avg karma 1.4 · | 2023-04-08 18:47:51

It not a good test. We'd be better off with benchmarks, testing capability updated every year to measure performance, ability and alignment!

funcDropShadow | karma 1182 | avg karma 1.55 · | 2023-04-07 03:41:31

These makes only sense if you have proper test data or test data generation.

buildbot | karma 5000 | avg karma 3.69 · | 2023-11-28 14:26:45

This conspiracy always comes up - don't you think that they test the output of the model revisions on probably 1000s of downstream tasks at this point? Bad responses are hard to reason about, could be prompting, could be a model revision, could just be bad luck.

buildbuildbuild | karma 1946 | avg karma 6.78 · | 2017-01-10 18:45:04

Changing a testing approach in my opinion is fine, but should result in the retroactive re-testing of previous models used in any in-article benchmark comparisons.

adrianratnapala | karma 3270 | avg karma 2.22 · | 2018-08-05 22:30:24

What I have trouble with is even gauging the quality of existing tests.

valenterry | karma 2235 | avg karma 1.44 · | 2024-06-04 09:16:11

Which is insufficient in terms of technical tests.