I’m trying to wrap my head around the case where there are two worlds, w1 and w2; w2 is better than w1, but moving from w1 to w2 is bad (ie. kill everyone and replacing them with different people who are happier, and we think this is bad).
I think for the equivalence to work in this case, the utility function U also needs to depend on your current state—if it’s the same for all states, then the agent would always prefer to move from w1 to w2 and erase it’s memory of the past when maximizing the utility function, wheras it would act correctly with the reward function.
That only works if the agent is motivated by something like “maximise your belief in what the expected value of U is”, rather than “maximise the expected value of U”. If you’ve got that problem, then the agent is unsalvageable—it could just edit its memory to make itself believe U is maximised.
Say w2a is the world where the agent starts in w2 and w2b is the world that results after the agent moves from w1 to w2.
Without considering the agent’s memory part of the world, it seems like the problem is worse: the only way to distinguish between w2a and w2b is the agent’s memory of past events, so it seems that leaving the agent’s memory over the past out of the utility function requires U(w2a) = U(w2b)
I’m trying to wrap my head around the case where there are two worlds, w1 and w2; w2 is better than w1, but moving from w1 to w2 is bad (ie. kill everyone and replacing them with different people who are happier, and we think this is bad).
I think for the equivalence to work in this case, the utility function U also needs to depend on your current state—if it’s the same for all states, then the agent would always prefer to move from w1 to w2 and erase it’s memory of the past when maximizing the utility function, wheras it would act correctly with the reward function.
>erase it’s memory
That only works if the agent is motivated by something like “maximise your belief in what the expected value of U is”, rather than “maximise the expected value of U”. If you’ve got that problem, then the agent is unsalvageable—it could just edit its memory to make itself believe U is maximised.
Say w2a is the world where the agent starts in w2 and w2b is the world that results after the agent moves from w1 to w2.
Without considering the agent’s memory part of the world, it seems like the problem is worse: the only way to distinguish between w2a and w2b is the agent’s memory of past events, so it seems that leaving the agent’s memory over the past out of the utility function requires U(w2a) = U(w2b)
U could depend on the entire history of states (rather than on the agent’s memory of that history).
Ah, misunderstood that, thanks.