Hacker Read top | best | new | newcomments | leaders | about | bookmarklet login

This conspiracy always comes up - don't you think that they test the output of the model revisions on probably 1000s of downstream tasks at this point? Bad responses are hard to reason about, could be prompting, could be a model revision, could just be bad luck.


sort by: page size:

So they don't have enough testing before deployments? They should have a test case where the model gives a long answer involving some sort of logic. It would've caught this.

Or maybe they skipped tests thinking this deployment wouldn't affect anything..


Hm, they stated that the gaps in the graphs were due to failures collecting the data for that release. To me that implied that they weren't rerunning the test on each version every time.

Test suites don't really explain the "why". The worst is having a hundred tests and every one of them verifying incorrect behavior - I've seen it happen.

it's way more likely to just be insufficient testing than anything on purpose imo. Those folks have deadlines just like everybody else.

It's one thing to ask for additional information on how the test was conducted and another to automatically attribute it to a PR play that IBM likes to "milk every last drop of".

That's cynical.


The existing models are technically bad candidates for this, as they will inevitably produce tests they can more effectively answer themselves.

It would have been nice to see if the changes actually were noticeable or not. Yes, one version may be faster than another, but simply ordering a set of tests like that doesn't show if it is worth the trouble doing anything about it.

It's more than that. Optimizely fear that poorly delivering tests reflect badly on their other products.

If language model that couldn't do the work passes your test then it's a bad test.

Flaky tests are an indication of non-determinism either in your test or your system.

Yeah, my first though upon reading the article was: If their E2E tests produced non-deterministic results due to asynchrony, how can they have any confidence that their production data ever becomes 'eventually consistent'?


There’s a slight problem with this approach - you can’t trust your test results in such environment… if you manage to complete your test suites at all before the next revision drops.

another explanation - they did test it in other scenarios but the results were against their hopes so they 'accidentally' omitted such tests in the 'official' test suite. Very common tactic, you massage your data until you get what you want.

Or something happened that didn't happen in the tests. And if they suspected something might be in an inconsistent state, taking some downtime to make sure it comes back up properly clearly is the better option.

I mean, this test is really annoying and it's testing a case that I'll guarantee 100% won't happen often in production.

I also am not sure how you would identify this with confidence. Sure, I can see the tests haven't failed on CI. However, maybe the tests fail all the time while the tests are being run during development.

It's almost always the case that applying a new set of tests finds more bugs so this experiment doesn't actually prove its conclusion.

My guess is bad regression testing based on subjective qualifiers at best or incentivize poor results that promote as revenue at worst.

Yup, it's just scoring the revisions. Currently there is no straightforward way to feedback to the tool for learning false positives yet.

Yes, that's terrible mistake from the QA test lead
next

Legal | privacy