If the goal is just for PAL to get the coffee to Dave as fast as possible, then PAL is operating correctly and there is no problem.
If Dave sees PAL take route B, and then does nothing, then in reality, PAL chooses route A. This is the intended behavior. If Dave sees PAL take route B, and tries to help PAL, then PAL choses path B. As a result, both in simulation and reality, Dave helps PAL. PAL gets the coffee faster than if PAL had chosen route A. PAL succesfully maximised it’s utility function. Again, intended behaviour. The mechanism that leads Dave to help PAL in the second senario is irrelevant.
I think this looks bad for two reasons. First, we might assign lower utility to worlds where Dave goes out of his way to help PAL, if you include that term in PAL’s utility function, the problem dissappears. Second, Dave’s reasoning is flawed. In reality, he will wind up helping PAL because he thinks that he’s in a simulation, even though he’s not. We might assign lower utility to worlds where Dave is wrong.
Right, this is a sort of incentive for deception. The deception is working fine at getting the objective; we want to ultimately solve this problem by changing the objective function so that it properly captures our dislike of deception (or of having to get up and carry a robot, or whatever), not by changing the search process to try to get it to not consider deceptive hypotheses.
If the goal is just for PAL to get the coffee to Dave as fast as possible, then PAL is operating correctly and there is no problem.
If Dave sees PAL take route B, and then does nothing, then in reality, PAL chooses route A. This is the intended behavior. If Dave sees PAL take route B, and tries to help PAL, then PAL choses path B. As a result, both in simulation and reality, Dave helps PAL. PAL gets the coffee faster than if PAL had chosen route A. PAL succesfully maximised it’s utility function. Again, intended behaviour. The mechanism that leads Dave to help PAL in the second senario is irrelevant.
I think this looks bad for two reasons. First, we might assign lower utility to worlds where Dave goes out of his way to help PAL, if you include that term in PAL’s utility function, the problem dissappears. Second, Dave’s reasoning is flawed. In reality, he will wind up helping PAL because he thinks that he’s in a simulation, even though he’s not. We might assign lower utility to worlds where Dave is wrong.
Right, this is a sort of incentive for deception. The deception is working fine at getting the objective; we want to ultimately solve this problem by changing the objective function so that it properly captures our dislike of deception (or of having to get up and carry a robot, or whatever), not by changing the search process to try to get it to not consider deceptive hypotheses.