Richard_Ngo comments on Reward is not the optimization target

Richard_Ngo 2 Aug 2022 7:57 UTC
LW: 2 AF: 2
0
AF
Hmm, perhaps clearer to say “reward does not automatically reinforce reward-focused thoughts into terminal values”, given that we both agree that agents will have thoughts about reward either way.
But if you agree that reward gets reinforced as an instrumental value, then I think your claims here probably need to actually describe the distinction between terminal and instrumental values. And this feels pretty fuzzy—e.g. in humans, I think the distinction is actually not that clear-cut.
In other words, if everyone agrees that reward likely becomes a strong instrumental value, then this seems like a prima facie reason to think that it’s also plausible as a terminal value, unless you think the processes which give rise to terminal values are very different from the processes which give rise to instrumental values.