Note that the current power differential between evals labs and frontier labs is such that I don’t expect evals labs have the slack to simply state that a frontier model failed their evals.
You’d need regulation with serious teeth and competent ‘bloodhound’ regulators watching the space like a hawk, for such a possibility to occur.
