Suppose we are building an agent, and we have a particular utility function U over states of the universe that we want the agent to optimize for. So we program into this agent a function CalculateUtility that computes the value of U given its current knowledge. Then we can program it to make decisions by searching through its available actions for the one that maximizes its expectation for its result of running CalculateUtility. But wait, how will an agent with this programming behave?
Suppose the agent has the opportunity (option A) to arrange to falsely believe the universe is in a state that is worth utility uFA but this action really leads to a different state worth utility uTA, and a competing opportunity (option B) to actually achieve a state of the universe that has utility uB, with uTA < uB < uFA. Then the agent will expect that if it takes option A that its CalculateUtility function will return uFA, and if it takes option B that its CalculateUtility function will return uB. uFA > uB, so the agent takes option A, and achieves a states of the universe with utility uTA which is worse than the utility uB it could have achieved if it had taken option B. This agent is not a very effective optimization process1. It would rather falsely believe that it has achieved its goals than actually achieve its goals. This sort of problem2 is known as wireheading.
Let us back up a step, and instead program our agent to make decisions by searching through its available actions for the one whose expected results maximizes its current calculation of CalculateUtility. Then, the agent would calculate that option A gives it expected utility uTA and option B gives it expected utility uB. uB > uTA, so it chooses option B and actually optimizes the universe. That is much better.
So, if you care about states of the universe, and not just your personal experience of maximizing your utility function, you should make choices that maximize your expected utility, not choices that maximize your expectation of perceived utility.
2. A similar problem occurs if the agent has the opportunity to modify its CalculateUtility function, so it returns large values for states of the universe that would have occurred anyways (or any state of the universe).
Maximise Expected Utility, not Expected Perception of Utility
Suppose we are building an agent, and we have a particular utility function U over states of the universe that we want the agent to optimize for. So we program into this agent a function CalculateUtility that computes the value of U given its current knowledge. Then we can program it to make decisions by searching through its available actions for the one that maximizes its expectation for its result of running CalculateUtility. But wait, how will an agent with this programming behave?
Suppose the agent has the opportunity (option A) to arrange to falsely believe the universe is in a state that is worth utility uFA but this action really leads to a different state worth utility uTA, and a competing opportunity (option B) to actually achieve a state of the universe that has utility uB, with uTA < uB < uFA. Then the agent will expect that if it takes option A that its CalculateUtility function will return uFA, and if it takes option B that its CalculateUtility function will return uB. uFA > uB, so the agent takes option A, and achieves a states of the universe with utility uTA which is worse than the utility uB it could have achieved if it had taken option B. This agent is not a very effective optimization process1. It would rather falsely believe that it has achieved its goals than actually achieve its goals. This sort of problem2 is known as wireheading.
Let us back up a step, and instead program our agent to make decisions by searching through its available actions for the one whose expected results maximizes its current calculation of CalculateUtility. Then, the agent would calculate that option A gives it expected utility uTA and option B gives it expected utility uB. uB > uTA, so it chooses option B and actually optimizes the universe. That is much better.
So, if you care about states of the universe, and not just your personal experience of maximizing your utility function, you should make choices that maximize your expected utility, not choices that maximize your expectation of perceived utility.
1. We might have expected this to work, because we built our agent to have beliefs that correspond to the actual state of the world.
2. A similar problem occurs if the agent has the opportunity to modify its CalculateUtility function, so it returns large values for states of the universe that would have occurred anyways (or any state of the universe).