If there were a proof of S, then cooperation passes X’s sanity test
Considering how much trouble it was to formulate the setups where this actually rigorously follows, I feel that this deserves a footnote, something like this:
Existence of a proof of S is not enough to conclude that X’s sanity check succeeds, since provability is not decidable and so the condition for failing the check doesn’t imply absence of a proof. One way out is to have a provability oracle perform the sanity check, which makes it impossible for players to be algorithms. Another is to prove somebounds on the length of the shortest proof of S, and write the algorithm in a way that guarantees the sanity check to succeed within these bounds. Yet another way is to never give up, work on the sanity check forever, which makes the agent completely unable to cope with many situations.
Considering how much trouble it was to formulate the setups where this actually rigorously follows, I feel that this deserves a footnote, something like this:
Existence of a proof of S is not enough to conclude that X’s sanity check succeeds, since provability is not decidable and so the condition for failing the check doesn’t imply absence of a proof. One way out is to have a provability oracle perform the sanity check, which makes it impossible for players to be algorithms. Another is to prove some bounds on the length of the shortest proof of S, and write the algorithm in a way that guarantees the sanity check to succeed within these bounds. Yet another way is to never give up, work on the sanity check forever, which makes the agent completely unable to cope with many situations.
Good footnote— I’ll add it.