The “RL ‘agents’ will maximize reward”/”The point of RL is to select for high reward” mistake is still made frequently and prominently. Yoshua Bengio (a Turing award winner!) recently gave a talk at an alignment workshop. Here’s one of his slides:
I think there may have been a communication error. It sounded to me like you were making the point that the policy does not have to internalize the reward function, but he was making the point that the training setup does attempt to find a policy that maximizes-as-far-as-it-can-tell the reward function. in other words, he was saying that reward is the optimization target of RL training, you were saying reward is not the optimization target of policy inference. Maybe.
I’m pretty sure he was talking about the trained policies and them, by default, maximizing reward outside the historical training distribution. He was making these claims very strongly and confidently, and in the very next slide cited Cohen’s Advanced artificial agents intervene in the provision of reward. That work advocates a very strong version of “policies will maximize some kind of reward because that’s the point of RL.”
He later appeared to clarify/back down from these claims, but in a way which seemed inconsistent with his slides, so I was pretty confused about his overall stance. His presentation, though, was going strong on “RL trains reward maximizers.”
There’s also a problem where a bunch of people appear to have cached that e.g. “inner alignment failures” can happen (whatever the heck that’s supposed to mean), but other parts of their beliefs seem to obviously not have incorporated this post’s main point. So if you say “hey you seem to be making this mistake”, they can point to some other part of their beliefs and go “but I don’t believe that in general!”.
The “RL ‘agents’ will maximize reward”/”The point of RL is to select for high reward” mistake is still made frequently and prominently. Yoshua Bengio (a Turing award winner!) recently gave a talk at an alignment workshop. Here’s one of his slides:
During questions, I questioned him, and he was incredulous that I disagreed. We chatted after his talk. I also sent him this article, and he disagreed with that as well. Bengio influences AI policy quite a bit, so I find this especially worrying. I do not want RL training methods to be dismissed or seen as suspect because of e.g. contingent terminological choices like “reward” or “agents.”
(Also, in my experience, if I don’t speak up and call out these claims, no one does.)
I think there may have been a communication error. It sounded to me like you were making the point that the policy does not have to internalize the reward function, but he was making the point that the training setup does attempt to find a policy that maximizes-as-far-as-it-can-tell the reward function. in other words, he was saying that reward is the optimization target of RL training, you were saying reward is not the optimization target of policy inference. Maybe.
I’m pretty sure he was talking about the trained policies and them, by default, maximizing reward outside the historical training distribution. He was making these claims very strongly and confidently, and in the very next slide cited Cohen’s Advanced artificial agents intervene in the provision of reward. That work advocates a very strong version of “policies will maximize some kind of reward because that’s the point of RL.”
He later appeared to clarify/back down from these claims, but in a way which seemed inconsistent with his slides, so I was pretty confused about his overall stance. His presentation, though, was going strong on “RL trains reward maximizers.”
There’s also a problem where a bunch of people appear to have cached that e.g. “inner alignment failures” can happen (whatever the heck that’s supposed to mean), but other parts of their beliefs seem to obviously not have incorporated this post’s main point. So if you say “hey you seem to be making this mistake”, they can point to some other part of their beliefs and go “but I don’t believe that in general!”.