Isn’t that essentially a false beliefs about one’s own preferences?
I mean, the AI “true” VNM utility function, to the extent that it has one, is going to be different than the utility function the AI think reflectively it has. In principle the AI could find out the difference and this could cause it to alter its behavior.
My preferred interpretation of that particular method is not “the agent has false beliefs,” but instead “the agent cares both about the factual and the counterfactual worlds, and is trying to maximize utility in both at once.” That is, if you were to cry
But if the humans press the button, the press signal will occur! So why are you acting such that you still get utility in the counterfactual world where humans press the button and the signal fails to occur?
It will look at you funny, and say “Because I care about that counterfactual world. See? It says so right here in my utility function.” It knows the world is counterfactual, it just cares about “what would have happened” anyway. (Causal decision nodes are used to formalize “what would have happened” in the agent’s preferences, this says nothing about whether the agent uses causal reasoning when making decisions.)
(Causal decision nodes are used to formalize “what would have happened” in the agent’s preferences, this says nothing about whether the agent uses causal reasoning when making decisions.)
Isn’t that essentially a false beliefs about one’s own preferences?
I mean, the AI “true” VNM utility function, to the extent that it has one, is going to be different than the utility function the AI think reflectively it has. In principle the AI could find out the difference and this could cause it to alter its behavior.
Or maybe not, I don’t have a strong intuition about this at the moment. But if I recall correctly, in the previous work on corrigibility (I didn’t read the last version you linked yet), Soares was thinking of using causal decision nodes to implement utility indifference for the shutdown problem. This effectively introduces false beliefs into the agent, as the agent is mistaken about what causes the button to be pressed.
My preferred interpretation of that particular method is not “the agent has false beliefs,” but instead “the agent cares both about the factual and the counterfactual worlds, and is trying to maximize utility in both at once.” That is, if you were to cry
It will look at you funny, and say “Because I care about that counterfactual world. See? It says so right here in my utility function.” It knows the world is counterfactual, it just cares about “what would have happened” anyway. (Causal decision nodes are used to formalize “what would have happened” in the agent’s preferences, this says nothing about whether the agent uses causal reasoning when making decisions.)
This greatly clarified the distinction for me. Well done.
Makes sense.
No. It’s an adjusted preference that functions in practice just like a false belief.