Tests are akin to scientific experiments. They test hypothesis and try to falsify claims. They shouldn't be seen as ground truth, but ways to gain information about what the system claims to be doing. In this sense it makes sense that tests will become obsolete or evolve with the system, because the model and domain upon which the system is based also evolves and changes with time.
This conspiracy always comes up - don't you think that they test the output of the model revisions on probably 1000s of downstream tasks at this point? Bad responses are hard to reason about, could be prompting, could be a model revision, could just be bad luck.
Changing a testing approach in my opinion is fine, but should result in the retroactive re-testing of previous models used in any in-article benchmark comparisons.
reply