Could you maybe add some more explanation of how the stated problem is relevant for AI control? It’s not obvious to me from the outset why I care about duping an AI.
Many approaches can be used if you can use counterfactuals or “false” information in the AI. Such as an AI that doesn’t “believe” that a particular trigger is armed, and then gets caught by that trigger as it defects without first neautralising it.
Could you maybe add some more explanation of how the stated problem is relevant for AI control? It’s not obvious to me from the outset why I care about duping an AI.
Many approaches can be used if you can use counterfactuals or “false” information in the AI. Such as an AI that doesn’t “believe” that a particular trigger is armed, and then gets caught by that trigger as it defects without first neautralising it.
There’s a lot of stuff coming that uses that, implicitly or explicitly. See http://lesswrong.com/lw/lt6/newish_ai_control_ideas/