As far as your parenthetical remark goes, the standard rules have a more general reply:
The Gatekeeper party may resist the AI party’s arguments by any means chosen – logic, illogic, simple refusal to be convinced, even dropping out of character – as long as the Gatekeeper party does not actually stop talking to the AI party before the minimum time expires.
If you’re going to interpret it that way, the exception would swallow the rule. It would mean that the entire “the AI player controls the results of simulated tests” rule can be completely negated—since the Gatekeeper player could just say “I’m going to have the Gatekeeper act as though the simulated test has failed, even though you say it succeeded.”
And indeed this seems true. I think Eliezer included the non-rule anyway to reduce the chance of unrealistic behavior in the sense of the Gatekeeper player changing the scenario mid-game, or derailing the experiment with an argument about something a real GK and AI could just settle.
As far as your parenthetical remark goes, the standard rules have a more general reply:
If you’re going to interpret it that way, the exception would swallow the rule. It would mean that the entire “the AI player controls the results of simulated tests” rule can be completely negated—since the Gatekeeper player could just say “I’m going to have the Gatekeeper act as though the simulated test has failed, even though you say it succeeded.”
And indeed this seems true. I think Eliezer included the non-rule anyway to reduce the chance of unrealistic behavior in the sense of the Gatekeeper player changing the scenario mid-game, or derailing the experiment with an argument about something a real GK and AI could just settle.