TurnTrout comments on Reward is not the optimization target

TurnTrout 2 Jan 2024 2:04 UTC
LW: 3 AF: 2
0
AF
I’m pretty sure he was talking about the trained policies and them, by default, maximizing reward outside the historical training distribution. He was making these claims very strongly and confidently, and in the very next slide cited Cohen’s Advanced artificial agents intervene in the provision of reward. That work advocates a very strong version of “policies will maximize some kind of reward because that’s the point of RL.”
He later appeared to clarify/back down from these claims, but in a way which seemed inconsistent with his slides, so I was pretty confused about his overall stance. His presentation, though, was going strong on “RL trains reward maximizers.”
There’s also a problem where a bunch of people appear to have cached that e.g. “inner alignment failures” can happen (whatever the heck that’s supposed to mean), but other parts of their beliefs seem to obviously not have incorporated this post’s main point. So if you say “hey you seem to be making this mistake”, they can point to some other part of their beliefs and go “but I don’t believe that in general!”.