So we have a switch with two positions, “R” and “L.”
When the switch is “R,” the agent is supposed to want to go to the right end of the hallway, and vice versa for “L” and left. It’s not that you want this agent to be uncertain about the “correct” value of the switch and so it’s learning more about the world as you send it signals—you just want the agent to want to go to the left when the switch is “L,” and to the right when the switch is “R.”
If you start with the agent going to the right along this hallway, and you change the switch to “L,” and then a minute later change your mind and switch back to “R,” it will have turned around and passed through the same spot in the hallway multiple times.
The point is that if you try to define a utility as a function of the state for this agent, you run into an issue with cycles—if you’re continuously moving “downhill”, you can’t get back to where you were before.
Yea, thanks for remembering me! You can also posit that the agent is omniscient from the start, so it did not change its policy due to learning. This argument proves that an agent cannot be corrigible and a maximizer of the same expected utility funtion of world states over multiple shutdowns. But still leaves the possibility for the agent to be corrigible while rewriting his utility function after every correction.
So we have a switch with two positions, “R” and “L.”
When the switch is “R,” the agent is supposed to want to go to the right end of the hallway, and vice versa for “L” and left. It’s not that you want this agent to be uncertain about the “correct” value of the switch and so it’s learning more about the world as you send it signals—you just want the agent to want to go to the left when the switch is “L,” and to the right when the switch is “R.”
If you start with the agent going to the right along this hallway, and you change the switch to “L,” and then a minute later change your mind and switch back to “R,” it will have turned around and passed through the same spot in the hallway multiple times.
The point is that if you try to define a utility as a function of the state for this agent, you run into an issue with cycles—if you’re continuously moving “downhill”, you can’t get back to where you were before.
Yea, thanks for remembering me! You can also posit that the agent is omniscient from the start, so it did not change its policy due to learning. This argument proves that an agent cannot be corrigible and a maximizer of the same expected utility funtion of world states over multiple shutdowns. But still leaves the possibility for the agent to be corrigible while rewriting his utility function after every correction.