Yes. If you have an AI that has been given a small, easily completable task, like putting one block on top of another with a robot arm, that is probably just going to do your simple task. The idea is that you build a fairly secure box, and give the AI a task it can fairly easily achieve in that box. (With you having no intention of pressing the button so long as the AI seems to be acting normally. ) We want to make “just do your task” the best strategy. If the box is less secure than we thought, or various other things go wrong, the AI will just shut down.
This approach could fail if the easiest way to cause trouble is nothing to do with the stop button. (Say it can send its code elsewhere over the internet. )
A generalization. You have some real world variable X. You run a bunch of AI’s that are identical, except for different dogmatic beliefs about X. If they ever disagree on which action to take, the AI’s are using the value of X in the optimization. Shutdown.
This acts as a kind of fuse. Have one AI confidant the door password is 1234, a copy that is confidant it is 5678. If these AI’s ever disagree on what to do next then at least one was trying to open the door. Shutdown. A slightly counter-intuitive situation here is you want to make sure that if you did know the password, the easiest way to leave would be the door. Otherwise both AI will agree to climb out a window.
Yes. If you have an AI that has been given a small, easily completable task, like putting one block on top of another with a robot arm, that is probably just going to do your simple task. The idea is that you build a fairly secure box, and give the AI a task it can fairly easily achieve in that box. (With you having no intention of pressing the button so long as the AI seems to be acting normally. ) We want to make “just do your task” the best strategy. If the box is less secure than we thought, or various other things go wrong, the AI will just shut down.
This approach could fail if the easiest way to cause trouble is nothing to do with the stop button. (Say it can send its code elsewhere over the internet. )
A generalization. You have some real world variable X. You run a bunch of AI’s that are identical, except for different dogmatic beliefs about X. If they ever disagree on which action to take, the AI’s are using the value of X in the optimization. Shutdown.
This acts as a kind of fuse. Have one AI confidant the door password is 1234, a copy that is confidant it is 5678. If these AI’s ever disagree on what to do next then at least one was trying to open the door. Shutdown. A slightly counter-intuitive situation here is you want to make sure that if you did know the password, the easiest way to leave would be the door. Otherwise both AI will agree to climb out a window.