Wei Dai comments on Hackable Rewards as a Safety Valve?

Wei Dai 10 Sep 2019 19:19 UTC
LW: 7 AF: 5
AF

It would be much easier for the AI to hack its own expectation operator, so that it predicts a 100% chance of continued survival

It’s very easy to build an AI that wouldn’t do this kind of hack, because the AI just has to use its current expectation operator when evaluating whether or not to hack its own expectation operator.

It’s much harder to build an AI that wouldn’t do other, more damaging, kinds of reward hacking (if the AI is designed around reward maximization in the first place).
- johnswentworth 10 Sep 2019 21:22 UTC
  LW: 7 AF: 4
  AF Parent
  Could you give an example for the latter, which wouldn’t also apply to hacking the expectation operator? The argument sounds plausible, but I’m not yet seeing what qualitative difference between the expectation and utility operators would make a wireheading AI modify one but not the other.
  - Wei Dai 10 Sep 2019 23:19 UTC
    LW: 15 AF: 6
    AF Parent
    If you’re thinking of a utility-maximizing agent, then it typically wouldn’t modify its own utility function. Instead I’m talking about reward-maximizing agents, which do not have internal utility functions but just try to maximize a reward signal coming from the outside, and “reward function” refers to the function computed by whatever is providing it with rewards.
    
    So a utility maximizing agent, like a paperclip-maximizer, can think “If I change my utility function to always return MAX_INT, then according to my current utility function, the universe will have very low expected utility.” But this kind of reasoning isn’t available to a reward-maximizing agent, because it doesn’t normally have access to the reward function. Instead it can only be programmed to think thoughts like “If I do X, what will be my future expected rewards” and “If I hack the reward function to always return MAX_INT, then my future expected rewards will be really high.” Not to mention “If I take over the universe so nobody can shut me down or change the reward function back, my expected rewards will be even higher.” (I’m anthropomorphizing to quickly convey the intuitions but all this can be turned into math pretty easily.)
    
    Does this help?
    
    ETA: Note that here I’m interpreting “hack” as “modify” or “tamper with”, but people sometimes use “reward hacking” to include “reward gaming” which means not physically changing the reward function but just taking advantage of unintentional flaws in the reward function to get high rewards without doing what the AI designer or user intends. In that sense of “hack”, utility hacking would be quite possible if the utility function isn’t totally aligned with human values.
    What links here?
    Reward is not the optimization target by TurnTrout (25 Jul 2022 0:03 UTC; 376 points)
    Wei Dai's comment on Utility ≠ Reward by Vlad Mikulik (12 Sep 2019 9:46 UTC; 7 points)
    - johnswentworth 11 Sep 2019 0:08 UTC
      LW: 3 AF: 1
      AF Parent
      I’m on-board with that distinction, and I was also thinking of reward-maximizers (despite my loose language).
      Part of the confusion may be different notions of “wireheading”: seizing an external reward channel, vs actual self-modification. If you’re picturing the former, then I agree that the agent won’t hack its expectation operator. It’s the latter I’m concerned with: under what circumstances would the agent self-modify, change its reward function, but leave the expectation operator untouched?
      Example: blue-maximizing robot. The robot might modify its own code so that get_reward(), rather than reading input from its camera and counting blue pixels, instead just returns a large number. The robot would do this because it doesn’t model itself as embedded in the environment, and it notices a large correlation between values computed by a program running in the environment (i.e. itself) and its rewards. But in this case, the modified function always returns the same large number—the robot no longer has any reason to worry about the rest of the world.
  - Matthew Barnett 10 Sep 2019 22:04 UTC
    LW: 1 AF: 1
    AF Parent
    I’m not yet seeing what qualitative difference between the expectation and utility operators would make a wireheading AI modify one but not the other.
    If we are modeling the agent as taking ${argmax}_{π} E [U (π)]$ , then it would easily see that manually setting its reward channel to the maximum would be the best policy. However, it wouldn’t see that setting its expectation value to 100% would be the best policy since that doesn’t actually increase its reward. [ETA: Assuming its utility function is such that a higher reward = higher utility. Also, I meant $U (x) | π$ not $U (π)$ ].
    - johnswentworth 10 Sep 2019 22:31 UTC
      LW: 2 AF: 1
      AF Parent
      So concretely, we have a blue-maximizing robot, it uses its current world-model to forecast the reward from holding a blue screen in front of its camera, and find that it’s probably high-reward. Now it tries to minimize the probability that someone takes the screen away. That’s the sort of scenario you’re talking about, yes?
      I agree that Wei Dai’s argument applies just fine to this sort of situation.
      Thing is, this kind of wireheading—simply seizing the reward channel—doesn’t actually involve any self-modification. The AI is still “working” just fine, or at least as well as it was working before. The problem here isn’t really wireheading at all, it’s that someone programmed a really dumb utility function.
      True wireheading would be if the AI modifies its utility function—i.e. the blue-minimizing robot changes its code (or hacks its hardware) to count red as also being blue. For instance, maybe the AI does not model itself as embedded in the environment, but learns that it gets a really strong reward signal when there’s a big number at a certain point in a program executed in the environment—which happens to be its own program execution. So, it modifies this program in the environment to just return a big number for expected_utility, thereby “accidentally” self-modifying.
      What I’m not seeing is, in situations where an AI would actually modify itself, when and why would it go for the utility function but not the expectation operator? Maybe people are just imagining “wireheading” in the form of seizing an external reward channel?
      - Matthew Barnett 10 Sep 2019 22:41 UTC
        LW: 3 AF: 2
        AF Parent
        Maybe people are just imagining “wireheading” in the form of seizing an external reward channel?
        Admittedly, that’s how I understood it. I don’t see why an expected utility maximizer would modify its utility function, since utility functions are reflectively stable.
        johnswentworth 10 Sep 2019 23:39 UTC
        LW: 5 AF: 2
        AF Parent
        The root issue is that Reward ≠ Utility. A utility function does not take in a policy, it takes in a state of the world—an expected utility maximizer chooses its policy based on what state(s) of the world it expects that policy to induce. Its objective looks like $E [U (x) | π]$ , where $x$ is the state of the world, and the policy/action $π$ matters only insofar as it changes the distribution of $x$ . The utility $U$ is internal to the agent. $U$ , as a function of the world state, is perfectly known to the utility maximizer—the only uncertainty is in the world state $x$ , and the only thing which the agent tries to control is the world-state $x$ . That’s why it’s reflectively stable: the utility function is “inside” the agent, not part of the “environment”, and the agent has no way to even consider changing it.
        A reward function, on the other hand, just takes in a policy directly—an expected reward maximizer’s objective looks like $E [U (π)]$ . Unlike a utility, the reward is “external” to the agent, and the reward function is unknown to the agent—the agent does not necessarily know what reward it will receive given some state of the world. The reward “function”, i.e. the function mapping a state of the world to a reward, is itself just another part of the environment, and the agent can and will consider changing it.
        Example: the blue-maximizing robot.
        A utility-maximizing blue-bot would model the world, look for all the blue things in its world-model, and maximize that number. This robot doesn’t actually have any reason to stick a blue screen in front of its camera, unless its world-model lacks object permanence. To make a utility-maximizing blue-bot which does sit in front of a blue screen would actually be more complicated: we’d need a model of the bot’s own camera, and a utility function over the blue pixels detected by that camera. (Or we’d need a world-model which didn’t include anything outside the camera’s view.)
        On the other hand, a reward-maximizing blue-bot doesn’t necessarily even have a notion of “state of the world”. If its reward is the number of blue pixels in the camera view, that’s what it maximizes—and if it can change the function mapping external world to camera pixels, in order to make more pixels blue, then it will. So it happily sits in front of a blue screen. Furthermore, a reward maximizer usually needs to learn the reward function, since it isn’t built-in. That leads to the sort of problem I mentioned above, where the agent doesn’t realize it’s embedded in the environment and “accidentally” self-modifies. That wouldn’t be a problem for a true utility maximizer with a decent world-model—the utility maximizer would recognize that modifying this chunk of the environment won’t actually cause higher utility, it’s just a correlation.
        Matthew Barnett 10 Sep 2019 23:46 UTC
        LW: 5 AF: 3
        AF Parent
        The root issue is that Reward ≠ Utility
        Agreed. When I wrote $U (π)$ I meant it as shorthand for $U (x) | π$ , though now that I look at it I can see that was criss-crossing between reward and utility in a very confusing way.
        That leads to the sort of problem I mentioned above, where the agent doesn’t realize it’s embedded in the environment and “accidentally” self-modifies.
        That makes sense now, although I am still curious whether there is a case where it purposely self modifies rather than accidentally does so.