Matthew Barnett comments on Hackable Rewards as a Safety Valve?

Matthew Barnett 10 Sep 2019 22:41 UTC
LW: 3 AF: 2
AF
Maybe people are just imagining “wireheading” in the form of seizing an external reward channel?
Admittedly, that’s how I understood it. I don’t see why an expected utility maximizer would modify its utility function, since utility functions are reflectively stable.
- johnswentworth 10 Sep 2019 23:39 UTC
  LW: 5 AF: 2
  AF Parent
  The root issue is that Reward ≠ Utility. A utility function does not take in a policy, it takes in a state of the world—an expected utility maximizer chooses its policy based on what state(s) of the world it expects that policy to induce. Its objective looks like $E [U (x) | π]$ , where $x$ is the state of the world, and the policy/action $π$ matters only insofar as it changes the distribution of $x$ . The utility $U$ is internal to the agent. $U$ , as a function of the world state, is perfectly known to the utility maximizer—the only uncertainty is in the world state $x$ , and the only thing which the agent tries to control is the world-state $x$ . That’s why it’s reflectively stable: the utility function is “inside” the agent, not part of the “environment”, and the agent has no way to even consider changing it.
  A reward function, on the other hand, just takes in a policy directly—an expected reward maximizer’s objective looks like $E [U (π)]$ . Unlike a utility, the reward is “external” to the agent, and the reward function is unknown to the agent—the agent does not necessarily know what reward it will receive given some state of the world. The reward “function”, i.e. the function mapping a state of the world to a reward, is itself just another part of the environment, and the agent can and will consider changing it.
  Example: the blue-maximizing robot.
  A utility-maximizing blue-bot would model the world, look for all the blue things in its world-model, and maximize that number. This robot doesn’t actually have any reason to stick a blue screen in front of its camera, unless its world-model lacks object permanence. To make a utility-maximizing blue-bot which does sit in front of a blue screen would actually be more complicated: we’d need a model of the bot’s own camera, and a utility function over the blue pixels detected by that camera. (Or we’d need a world-model which didn’t include anything outside the camera’s view.)
  On the other hand, a reward-maximizing blue-bot doesn’t necessarily even have a notion of “state of the world”. If its reward is the number of blue pixels in the camera view, that’s what it maximizes—and if it can change the function mapping external world to camera pixels, in order to make more pixels blue, then it will. So it happily sits in front of a blue screen. Furthermore, a reward maximizer usually needs to learn the reward function, since it isn’t built-in. That leads to the sort of problem I mentioned above, where the agent doesn’t realize it’s embedded in the environment and “accidentally” self-modifies. That wouldn’t be a problem for a true utility maximizer with a decent world-model—the utility maximizer would recognize that modifying this chunk of the environment won’t actually cause higher utility, it’s just a correlation.
  - Matthew Barnett 10 Sep 2019 23:46 UTC
    LW: 5 AF: 3
    AF Parent
    The root issue is that Reward ≠ Utility
    Agreed. When I wrote $U (π)$ I meant it as shorthand for $U (x) | π$ , though now that I look at it I can see that was criss-crossing between reward and utility in a very confusing way.
    That leads to the sort of problem I mentioned above, where the agent doesn’t realize it’s embedded in the environment and “accidentally” self-modifies.
    That makes sense now, although I am still curious whether there is a case where it purposely self modifies rather than accidentally does so.