Matthew Barnett comments on Hackable Rewards as a Safety Valve?

Matthew Barnett 10 Sep 2019 22:04 UTC
LW: 1 AF: 1
AF
I’m not yet seeing what qualitative difference between the expectation and utility operators would make a wireheading AI modify one but not the other.
If we are modeling the agent as taking ${argmax}_{π} E [U (π)]$ , then it would easily see that manually setting its reward channel to the maximum would be the best policy. However, it wouldn’t see that setting its expectation value to 100% would be the best policy since that doesn’t actually increase its reward. [ETA: Assuming its utility function is such that a higher reward = higher utility. Also, I meant $U (x) | π$ not $U (π)$ ].
- johnswentworth 10 Sep 2019 22:31 UTC
  LW: 2 AF: 1
  AF Parent
  So concretely, we have a blue-maximizing robot, it uses its current world-model to forecast the reward from holding a blue screen in front of its camera, and find that it’s probably high-reward. Now it tries to minimize the probability that someone takes the screen away. That’s the sort of scenario you’re talking about, yes?
  I agree that Wei Dai’s argument applies just fine to this sort of situation.
  Thing is, this kind of wireheading—simply seizing the reward channel—doesn’t actually involve any self-modification. The AI is still “working” just fine, or at least as well as it was working before. The problem here isn’t really wireheading at all, it’s that someone programmed a really dumb utility function.
  True wireheading would be if the AI modifies its utility function—i.e. the blue-minimizing robot changes its code (or hacks its hardware) to count red as also being blue. For instance, maybe the AI does not model itself as embedded in the environment, but learns that it gets a really strong reward signal when there’s a big number at a certain point in a program executed in the environment—which happens to be its own program execution. So, it modifies this program in the environment to just return a big number for expected_utility, thereby “accidentally” self-modifying.
  What I’m not seeing is, in situations where an AI would actually modify itself, when and why would it go for the utility function but not the expectation operator? Maybe people are just imagining “wireheading” in the form of seizing an external reward channel?
  - Matthew Barnett 10 Sep 2019 22:41 UTC
    LW: 3 AF: 2
    AF Parent
    Maybe people are just imagining “wireheading” in the form of seizing an external reward channel?
    Admittedly, that’s how I understood it. I don’t see why an expected utility maximizer would modify its utility function, since utility functions are reflectively stable.
    - johnswentworth 10 Sep 2019 23:39 UTC
      LW: 5 AF: 2
      AF Parent
      The root issue is that Reward ≠ Utility. A utility function does not take in a policy, it takes in a state of the world—an expected utility maximizer chooses its policy based on what state(s) of the world it expects that policy to induce. Its objective looks like $E [U (x) | π]$ , where $x$ is the state of the world, and the policy/action $π$ matters only insofar as it changes the distribution of $x$ . The utility $U$ is internal to the agent. $U$ , as a function of the world state, is perfectly known to the utility maximizer—the only uncertainty is in the world state $x$ , and the only thing which the agent tries to control is the world-state $x$ . That’s why it’s reflectively stable: the utility function is “inside” the agent, not part of the “environment”, and the agent has no way to even consider changing it.
      A reward function, on the other hand, just takes in a policy directly—an expected reward maximizer’s objective looks like $E [U (π)]$ . Unlike a utility, the reward is “external” to the agent, and the reward function is unknown to the agent—the agent does not necessarily know what reward it will receive given some state of the world. The reward “function”, i.e. the function mapping a state of the world to a reward, is itself just another part of the environment, and the agent can and will consider changing it.
      Example: the blue-maximizing robot.
      A utility-maximizing blue-bot would model the world, look for all the blue things in its world-model, and maximize that number. This robot doesn’t actually have any reason to stick a blue screen in front of its camera, unless its world-model lacks object permanence. To make a utility-maximizing blue-bot which does sit in front of a blue screen would actually be more complicated: we’d need a model of the bot’s own camera, and a utility function over the blue pixels detected by that camera. (Or we’d need a world-model which didn’t include anything outside the camera’s view.)
      On the other hand, a reward-maximizing blue-bot doesn’t necessarily even have a notion of “state of the world”. If its reward is the number of blue pixels in the camera view, that’s what it maximizes—and if it can change the function mapping external world to camera pixels, in order to make more pixels blue, then it will. So it happily sits in front of a blue screen. Furthermore, a reward maximizer usually needs to learn the reward function, since it isn’t built-in. That leads to the sort of problem I mentioned above, where the agent doesn’t realize it’s embedded in the environment and “accidentally” self-modifies. That wouldn’t be a problem for a true utility maximizer with a decent world-model—the utility maximizer would recognize that modifying this chunk of the environment won’t actually cause higher utility, it’s just a correlation.
      - Matthew Barnett 10 Sep 2019 23:46 UTC
        LW: 5 AF: 3
        AF Parent
        The root issue is that Reward ≠ Utility
        Agreed. When I wrote $U (π)$ I meant it as shorthand for $U (x) | π$ , though now that I look at it I can see that was criss-crossing between reward and utility in a very confusing way.
        That leads to the sort of problem I mentioned above, where the agent doesn’t realize it’s embedded in the environment and “accidentally” self-modifies.
        That makes sense now, although I am still curious whether there is a case where it purposely self modifies rather than accidentally does so.