Many approaches can be used if you can use counterfactuals or “false” information in the AI. Such as an AI that doesn’t “believe” that a particular trigger is armed, and then gets caught by that trigger as it defects without first neautralising it.
There’s a lot of stuff coming that uses that, implicitly or explicitly. See http://lesswrong.com/lw/lt6/newish_ai_control_ideas/
Many approaches can be used if you can use counterfactuals or “false” information in the AI. Such as an AI that doesn’t “believe” that a particular trigger is armed, and then gets caught by that trigger as it defects without first neautralising it.
There’s a lot of stuff coming that uses that, implicitly or explicitly. See http://lesswrong.com/lw/lt6/newish_ai_control_ideas/