Hacker Read

buildbot · 2023-11-28 14:26:45

This conspiracy always comes up - don't you think that they test the output of the model revisions on probably 1000s of downstream tasks at this point? Bad responses are hard to reason about, could be prompting, could be a model revision, could just be bad luck.

yousif_123123 | karma 172 | avg karma 6.37 · | 2024-02-22 03:20:16

So they don't have enough testing before deployments? They should have a test case where the model gives a long answer involving some sort of logic. It would've caught this.

Or maybe they skipped tests thinking this deployment wouldn't affect anything..

reply

wanderr | karma 1367 | avg karma 2.38 · | 2012-04-04 20:56:46+00:00

Hm, they stated that the gaps in the graphs were due to failures collecting the data for that release. To me that implied that they weren't rerunning the test on each version every time.

slededit | karma 4343 | avg karma 2.43 · | 2018-07-12 01:26:42

Test suites don't really explain the "why". The worst is having a hundred tests and every one of them verifying incorrect behavior - I've seen it happen.

johnny22 | karma 1509 | avg karma 1.37 · | 2022-10-10 17:47:56

it's way more likely to just be insufficient testing than anything on purpose imo. Those folks have deadlines just like everybody else.

bitmapbrother | karma 1328 | avg karma 0.75 · | 2016-08-07 02:30:35+00:00

It's one thing to ask for additional information on how the test was conducted and another to automatically attribute it to a PR play that IBM likes to "milk every last drop of".

That's cynical.

reply

brucethemoose2 | karma 7874 | avg karma 2.4 · | 2023-09-27 22:05:54

The existing models are technically bad candidates for this, as they will inevitably produce tests they can more effectively answer themselves.

yxhuvud | karma 3427 | avg karma 1.76 · | 2014-12-17 17:15:52+00:00

It would have been nice to see if the changes actually were noticeable or not. Yes, one version may be faster than another, but simply ordering a set of tests like that doesn't show if it is worth the trouble doing anything about it.

jbreckmckye | karma 1861 | avg karma 4.81 · | 2018-02-12 17:40:22

It's more than that. Optimizely fear that poorly delivering tests reflect badly on their other products.

scotty79 | karma 14043 | avg karma 1.62 · | 2023-02-05 11:20:02

If language model that couldn't do the work passes your test then it's a bad test.

ezekiel68 | karma 740 | avg karma 2.63 · | 2021-09-24 14:41:19

Flaky tests are an indication of non-determinism either in your test or your system.

Yeah, my first though upon reading the article was: If their E2E tests produced non-deterministic results due to asynchrony, how can they have any confidence that their production data ever becomes 'eventually consistent'?

reply

baq | karma 14757 | avg karma 2.44 · | 2023-08-19 01:44:12

There’s a slight problem with this approach - you can’t trust your test results in such environment… if you manage to complete your test suites at all before the next revision drops.

pi-e-sigma | karma 344 | avg karma 1.0 · | 2024-03-14 20:46:45

another explanation - they did test it in other scenarios but the results were against their hopes so they 'accidentally' omitted such tests in the 'official' test suite. Very common tactic, you massage your data until you get what you want.

detaro | karma 40664 | avg karma 1.79 · | 2016-01-29 17:00:54+00:00

Or something happened that didn't happen in the tests. And if they suspected something might be in an inconsistent state, taking some downtime to make sure it comes back up properly clearly is the better option.

ranting-moth | karma 2517 | avg karma 5.23 · | 2023-01-10 16:48:22

I mean, this test is really annoying and it's testing a case that I'll guarantee 100% won't happen often in production.

ajmurmann | karma 10380 | avg karma 2.92 · | 2016-10-08 18:25:30+00:00

I also am not sure how you would identify this with confidence. Sure, I can see the tests haven't failed on CI. However, maybe the tests fail all the time while the tests are being run during development.

anamax | karma 7082 | avg karma 1.19 · | 2012-06-20 18:31:00+00:00

It's almost always the case that applying a new set of tests finds more bugs so this experiment doesn't actually prove its conclusion.

duped | karma 7438 | avg karma 3.07 · | 2022-01-29 17:14:14

My guess is bad regression testing based on subjective qualifiers at best or incentivize poor results that promote as revenue at worst.

kenrick95 | karma 1093 | avg karma 4.99 · | 2015-12-02 17:49:17

Yup, it's just scoring the revisions. Currently there is no straightforward way to feedback to the tool for learning false positives yet.

leong_rm | karma 1 | avg karma 1.0 · | 2024-06-12 22:09:00

Yes, that's terrible mistake from the QA test lead