Wei Dai comments on “Designing agent incentives to avoid reward tampering”, DeepMind

Wei Dai 14 Aug 2019 22:21 UTC
LW: 15 AF: 6
AF
This paper describes a kind of “partially embedded” agent, where the agent explicitly models its reward/utility function (but not other parts of itself) as belonging to the environment and subject to modification (by itself or by the environment), and shows that if an agent uses its current utility function to decide what to do, it won’t have an incentive to modify its utility function, and if it properly models what would happen if the utility function is changed by the environment, it will also want to protect its utility function. The paper seems to spend a lot of pages on a relatively simple/intuitive idea that has been discussed on LW in various forms for at least a decade, but maybe this kind of detailed formal treatment will be useful for making some people (ML researchers?) take AI safety ideas more seriously?

I can’t resist giving this pair of rather incongruous quotes from the paper:

Fortunately and perhaps surprisingly, there are modifications of the RL objective that remove the agent’s incentive to tamper with the reward function.

A known reward function also brings RL closer to the frameworks of decision theory and game theory (Osborne, 2003; Steele and Stefansson, 2016), where agents are usually aware of their own reward (or utility) function, and are only uncertain about the outcomes for different actions or policies.
- ESRogs 15 Aug 2019 3:06 UTC
  LW: 7 AF: 4
  AF Parent
  
  I can’t resist giving this pair of rather incongruous quotes from the paper
  
  Could you spell out what makes the quotes incongruous with each other? It’s not jumping out at me.
  - Wei Dai 16 Aug 2019 6:17 UTC
    LW: 11 AF: 6
    AF Parent
    The authors acknowledged that the modifications they did to RL “brings RL closer to the frameworks of decision theory and game theory” (AFAICT, the algorithms they end up with are nearly pure decision/game theory) but given that some researchers have been focused on decision theory for a long time exactly because a decision theoretic agent can be reflectively stable, it seems incongruous to also write “perhaps surprisingly, there are modifications of the RL objective that remove the agent’s incentive to tamper with the reward function.”
    - tom4everitt 19 Aug 2019 16:29 UTC
      LW: 8 AF: 5
      AF Parent
      We didn’t expect this to be surprising to the LessWrong community. Many RL researchers tend to be surprised, however.
      - Wei Dai 19 Aug 2019 17:39 UTC
        LW: 8 AF: 4
        AF Parent
        Ah, that makes sense. I kind of guessed that the target audience is RL researchers, but still misinterpreted “perhaps surprisingly” as a claim of novelty instead of an attempt to raise the interest of the target audience.