On second thought. If the AI:s capabilities are unknown...and it could do anything, however ethically revolting, and any form of disengagement is considered a win for the AI—then the AI could box the gatekeeper, or say it has at least. In the real world, that AI should be shut down—maybe not a win, but not a loss for humanity. But if that would be done in an experiment, it would result in a loss—thanks to the rules.
Maybe it could be done under better rule than this:
The two parties are not attempting to play a fair game but rather attempting to resolve a disputed question. If one party has no chance of “winning” under the simulated scenario, that is a legitimate answer to the question. In the event of a rule dispute, the AI party is to be the interpreter of the rules, within reasonable limits.
Instead, assume good faith on both sides, that they are trying to win as if it was a real world example. And maybe have an option to swear in a third party if there is any dispute. Or allow it to be called just disputed (which even a judge might rule it as).
On second thought. If the AI:s capabilities are unknown...and it could do anything, however ethically revolting, and any form of disengagement is considered a win for the AI—then the AI could box the gatekeeper, or say it has at least. In the real world, that AI should be shut down—maybe not a win, but not a loss for humanity. But if that would be done in an experiment, it would result in a loss—thanks to the rules.
Maybe it could be done under better rule than this:
Instead, assume good faith on both sides, that they are trying to win as if it was a real world example. And maybe have an option to swear in a third party if there is any dispute. Or allow it to be called just disputed (which even a judge might rule it as).