soth02 comments on Half-baked AI Safety ideas thread

soth02 18 Aug 2022 0:56 UTC
1 point
There might be a way to elicit how aligned/unaligned the putative AGI is.
1. Enter into a Prisoner’s Dilemma type scenario with the putative AGI.
2. Start off in the non-Nash equilibrium of cooperate/cooperate.
3. The number of rounds is specified at random and isn’t known to participants. (possible variant is declare false last rounds, and then continue playing for x rounds).
4. Observe when/if the putative AGI defects in the ‘last’ round.