For example, if the AI was contained in a simulation, inside of which the AI was contained in a weak AI box, then it might be much more difficult to detect and understand the nature of the simulation than to escape the simulated AI box, which would signal treacherous turn.
That approach sounds problematic. Some of the obvious escape methods would target the minds of the researchers (either through real-time interaction or by embedding messages in its code or output). You could cut off the latter by having strong social rules to not look at anything beyond a one-bit verdict on whether the AI attempted escape, but providing a convincing in-simulation environment would likely be difficult if you can’t connect it to its actual writers.
Or it could be kept in one’s back pocket, and pulled out whenever an AI researcher claims in public that they’ve solved friendliness.
That strategy could be very dangerous. It’d work on the less ambitious/arrogant sort of researcher; the more confident sort might well follow up with “I’ll just go and implement this, and get all the credit for saving the world single-handedly” instead of saying anything in public, never giving you the chance to pull out your challenge.
That approach sounds problematic. Some of the obvious escape methods would target the minds of the researchers (either through real-time interaction or by embedding messages in its code or output). You could cut off the latter by having strong social rules to not look at anything beyond a one-bit verdict on whether the AI attempted escape, but providing a convincing in-simulation environment would likely be difficult if you can’t connect it to its actual writers.
That strategy could be very dangerous. It’d work on the less ambitious/arrogant sort of researcher; the more confident sort might well follow up with “I’ll just go and implement this, and get all the credit for saving the world single-handedly” instead of saying anything in public, never giving you the chance to pull out your challenge.