This conspiracy always comes up - don't you think that they test the output of the model revisions on probably 1000s of downstream tasks at this point? Bad responses are hard to reason about, could be prompting, could be a model revision, could just be bad luck.
So they don't have enough testing before deployments? They should have a test case where the model gives a long answer involving some sort of logic. It would've caught this.
Or maybe they skipped tests thinking this deployment wouldn't affect anything..
Hm, they stated that the gaps in the graphs were due to failures collecting the data for that release. To me that implied that they weren't rerunning the test on each version every time.
Test suites don't really explain the "why". The worst is having a hundred tests and every one of them verifying incorrect behavior - I've seen it happen.
It's one thing to ask for additional information on how the test was conducted and another to automatically attribute it to a PR play that IBM likes to "milk every last drop of".
It would have been nice to see if the changes actually were noticeable or not. Yes, one version may be faster than another, but simply ordering a set of tests like that doesn't show if it is worth the trouble doing anything about it.
Flaky tests are an indication of non-determinism either in your test or your system.
Yeah, my first though upon reading the article was: If their E2E tests produced non-deterministic results due to asynchrony, how can they have any confidence that their production data ever becomes 'eventually consistent'?
There’s a slight problem with this approach - you can’t trust your test results in such environment… if you manage to complete your test suites at all before the next revision drops.
another explanation - they did test it in other scenarios but the results were against their hopes so they 'accidentally' omitted such tests in the 'official' test suite. Very common tactic, you massage your data until you get what you want.
Or something happened that didn't happen in the tests. And if they suspected something might be in an inconsistent state, taking some downtime to make sure it comes back up properly clearly is the better option.
I also am not sure how you would identify this with confidence. Sure, I can see the tests haven't failed on CI. However, maybe the tests fail all the time while the tests are being run during development.
reply