TurnTrout comments on Reward is not the optimization target

TurnTrout 1 Aug 2022 22:11 UTC
LW: 4 AF: 3
0
AF
These all sound somewhat like predictions I would make? My intended point is that if the button is out of the agent’s easy reach, and the agent doesn’t explore into the button early in training, by the time it’s smart enough to model the effects of the distant reward button, the agent won’t want to go mash the button as fast as possible.
- Charlie Steiner 1 Aug 2022 23:43 UTC
  LW: 4 AF: 2
  0
  AF Parent
  But Agent 57 (or its successor) would go mash the button once it figured out how to do it. Kinda like the salt-starved rats from that one Steve Byrnes post. Put another way, my claim is that the architectural tweaks that let you beat Montezuma’s Revenge with RL are very similar to the architectural tweaks that make your agent act like it really is motivated by reward, across a broader domain.
  - TurnTrout 7 Aug 2022 16:48 UTC
    LW: 2 AF: 2
    0
    AF Parent
    (Haven’t checked out Agent 57 in particular, but expect it to not have the “actually optimizes reward” property in the cases I argue against in the post.)