There might be a way to elicit how aligned/unaligned the putative AGI is.
Enter into a Prisoner’s Dilemma type scenario with the putative AGI.
Start off in the non-Nash equilibrium of cooperate/cooperate.
The number of rounds is specified at random and isn’t known to participants. (possible variant is declare false last rounds, and then continue playing for x rounds).
Observe when/if the putative AGI defects in the ‘last’ round.
There might be a way to elicit how aligned/unaligned the putative AGI is.
Enter into a Prisoner’s Dilemma type scenario with the putative AGI.
Start off in the non-Nash equilibrium of cooperate/cooperate.
The number of rounds is specified at random and isn’t known to participants. (possible variant is declare false last rounds, and then continue playing for x rounds).
Observe when/if the putative AGI defects in the ‘last’ round.