Retrospective: I think this is the most important post I wrote in 2022. I deeply hope that more people benefit by fully integrating these ideas into their worldviews. I think there’s a way to “see” this lesson everywhere in alignment: for it to inform your speculation about everything from supervised fine-tuning to reward overoptimization. To see past mistaken assumptions about how learning processes work, and to think for oneself instead. This post represents an invaluable tool in my mental toolbelt.
I wish I had written the key lessons and insights more plainly. I think I got a bit carried away with in-group terminology and linguistic conventions, which limited the reach and impact of these insights.
I am less wedded to “think about what shards will form and make sure they don’t care about bad stuff (like reward)”, because I think we won’t get intrinsically agentic policy networks. I think the most impactful AIs will be LLMs+tools+scaffolding, with the LLMs themselves being “tool AI.”
The “RL ‘agents’ will maximize reward”/”The point of RL is to select for high reward” mistake is still made frequently and prominently. Yoshua Bengio (a Turing award winner!) recently gave a talk at an alignment workshop. Here’s one of his slides:
I think there may have been a communication error. It sounded to me like you were making the point that the policy does not have to internalize the reward function, but he was making the point that the training setup does attempt to find a policy that maximizes-as-far-as-it-can-tell the reward function. in other words, he was saying that reward is the optimization target of RL training, you were saying reward is not the optimization target of policy inference. Maybe.
I’m pretty sure he was talking about the trained policies and them, by default, maximizing reward outside the historical training distribution. He was making these claims very strongly and confidently, and in the very next slide cited Cohen’s Advanced artificial agents intervene in the provision of reward. That work advocates a very strong version of “policies will maximize some kind of reward because that’s the point of RL.”
He later appeared to clarify/back down from these claims, but in a way which seemed inconsistent with his slides, so I was pretty confused about his overall stance. His presentation, though, was going strong on “RL trains reward maximizers.”
There’s also a problem where a bunch of people appear to have cached that e.g. “inner alignment failures” can happen (whatever the heck that’s supposed to mean), but other parts of their beliefs seem to obviously not have incorporated this post’s main point. So if you say “hey you seem to be making this mistake”, they can point to some other part of their beliefs and go “but I don’t believe that in general!”.
Retrospective: I think this is the most important post I wrote in 2022. I deeply hope that more people benefit by fully integrating these ideas into their worldviews. I think there’s a way to “see” this lesson everywhere in alignment: for it to inform your speculation about everything from supervised fine-tuning to reward overoptimization. To see past mistaken assumptions about how learning processes work, and to think for oneself instead. This post represents an invaluable tool in my mental toolbelt.
I wish I had written the key lessons and insights more plainly. I think I got a bit carried away with in-group terminology and linguistic conventions, which limited the reach and impact of these insights.
I am less wedded to “think about what shards will form and make sure they don’t care about bad stuff (like reward)”, because I think we won’t get intrinsically agentic policy networks. I think the most impactful AIs will be LLMs+tools+scaffolding, with the LLMs themselves being “tool AI.”
The “RL ‘agents’ will maximize reward”/”The point of RL is to select for high reward” mistake is still made frequently and prominently. Yoshua Bengio (a Turing award winner!) recently gave a talk at an alignment workshop. Here’s one of his slides:
During questions, I questioned him, and he was incredulous that I disagreed. We chatted after his talk. I also sent him this article, and he disagreed with that as well. Bengio influences AI policy quite a bit, so I find this especially worrying. I do not want RL training methods to be dismissed or seen as suspect because of e.g. contingent terminological choices like “reward” or “agents.”
(Also, in my experience, if I don’t speak up and call out these claims, no one does.)
I think there may have been a communication error. It sounded to me like you were making the point that the policy does not have to internalize the reward function, but he was making the point that the training setup does attempt to find a policy that maximizes-as-far-as-it-can-tell the reward function. in other words, he was saying that reward is the optimization target of RL training, you were saying reward is not the optimization target of policy inference. Maybe.
I’m pretty sure he was talking about the trained policies and them, by default, maximizing reward outside the historical training distribution. He was making these claims very strongly and confidently, and in the very next slide cited Cohen’s Advanced artificial agents intervene in the provision of reward. That work advocates a very strong version of “policies will maximize some kind of reward because that’s the point of RL.”
He later appeared to clarify/back down from these claims, but in a way which seemed inconsistent with his slides, so I was pretty confused about his overall stance. His presentation, though, was going strong on “RL trains reward maximizers.”
There’s also a problem where a bunch of people appear to have cached that e.g. “inner alignment failures” can happen (whatever the heck that’s supposed to mean), but other parts of their beliefs seem to obviously not have incorporated this post’s main point. So if you say “hey you seem to be making this mistake”, they can point to some other part of their beliefs and go “but I don’t believe that in general!”.