TurnTrout comments on Reward is not the optimization target

TurnTrout 2 Aug 2022 1:28 UTC
LW: 7 AF: 5
2
AF
Actually, while I did recheck the Reward is Enough paper, I think I did misunderstand part of it in a way which wasn’t obvious to me while I reread, which makes the paper much less egregious. I am updating that you are correct and I am not spending enough effort on favorably interpreting existing discourse.
I still disagree with parts of that essay and still think Sutton & co don’t understand the key points. I still think you underestimate how much people don’t get these points. I am provisionally retracting the comment you replied to while I compose a more thorough response (may be a little while).
Sufficiently intelligent RL policies will have the concept of reward because they understand many facts about machine learning and their own situation, and (if deceptively aligned) will think about reward a bunch. There may be some other argument for why this concept won’t get embedded as a terminal goal, but the idea that it needs to be “magically spawned” is very strawmanny.
Agreed on both counts for your first sentence.
The “and” in “reward does not magically spawn thoughts about reward, and reinforce those reward-focused thoughts” is doing important work; “magically” is meant to apply to the conjunction of the clauses. I added the second clause in order to pre-empt this objection. Maybe I should have added “reinforce those reward-focused thoughts into terminal values.” Would that have been clearer? (I also have gone ahead and replaced “magically” with “automatically.”)
- Richard_Ngo 2 Aug 2022 7:57 UTC
  LW: 2 AF: 2
  0
  AF Parent
  Hmm, perhaps clearer to say “reward does not automatically reinforce reward-focused thoughts into terminal values”, given that we both agree that agents will have thoughts about reward either way.
  But if you agree that reward gets reinforced as an instrumental value, then I think your claims here probably need to actually describe the distinction between terminal and instrumental values. And this feels pretty fuzzy—e.g. in humans, I think the distinction is actually not that clear-cut.
  In other words, if everyone agrees that reward likely becomes a strong instrumental value, then this seems like a prima facie reason to think that it’s also plausible as a terminal value, unless you think the processes which give rise to terminal values are very different from the processes which give rise to instrumental values.