ESRogs comments on “Designing agent incentives to avoid reward tampering”, DeepMind

ESRogs 15 Aug 2019 3:06 UTC
LW: 7 AF: 4
AF

I can’t resist giving this pair of rather incongruous quotes from the paper

Could you spell out what makes the quotes incongruous with each other? It’s not jumping out at me.
- Wei Dai 16 Aug 2019 6:17 UTC
  LW: 11 AF: 6
  AF Parent
  The authors acknowledged that the modifications they did to RL “brings RL closer to the frameworks of decision theory and game theory” (AFAICT, the algorithms they end up with are nearly pure decision/game theory) but given that some researchers have been focused on decision theory for a long time exactly because a decision theoretic agent can be reflectively stable, it seems incongruous to also write “perhaps surprisingly, there are modifications of the RL objective that remove the agent’s incentive to tamper with the reward function.”
  - tom4everitt 19 Aug 2019 16:29 UTC
    LW: 8 AF: 5
    AF Parent
    We didn’t expect this to be surprising to the LessWrong community. Many RL researchers tend to be surprised, however.
    - Wei Dai 19 Aug 2019 17:39 UTC
      LW: 8 AF: 4
      AF Parent
      Ah, that makes sense. I kind of guessed that the target audience is RL researchers, but still misinterpreted “perhaps surprisingly” as a claim of novelty instead of an attempt to raise the interest of the target audience.