Fair enough if you just want to talk about outer alignment.
That said, I think the particular concern expressed in the paper you link—namely, that the agent’s reward model could break OOD while the agent’s capabilities remain otherwise intact—doesn’t seem like it would be an issue here? Indeed, the agent’s reward model is pulled out of its world model, so if the world model keeps working OOD (i.e. keeps making good predictions about human behavior, which depend on good predictions about what humans value) then the reward model should keep working as well.
I agree that this implies that the utility function you get in Step 2 will be good and will continue working OOD.
I assumed that in Step 3, you would plug that utility function as the reward function into an algorithm like PPO in order to train a policy that acted well. The issue is then that the resulting policy could end up optimizing for something else OOD, even if the utility function would have done the right thing, in the same way that the CoinRun policy ends up always going to the end of the level even though it was trained on the desired reward function of “+10 if you get the coin, 0 otherwise”.
Maybe you have some different Step 3 in mind besides “run PPO”?
Thanks, this is indeed a point I hadn’t fully appreciated: even if a reward function generalizes well OOD, that doesn’t mean that a policy trained on that reward function does.
It seems like the issue here is that it’s a bad idea to ever take your policy offline, analogously to what happens in reward modeling from human feedback (namely, reward models stops being good once you take them offline). Does that seem right? Of course, keeping an RL agent in learning mode forever might also have issues, most obviously unsafe exploration. Are there other things that also go wrong?
I agree that one major mitigation is to keep training your policy online, but that doesn’t necessarily prevent a misaligned policy from taking over the world before the training has time to fix its mistakes. In particular, if the policy is reasoning “I’ll behave well until the moment I strike”, and your reward function can’t detect that (it only detects whether the output was good), then the policy will look great until the moment it takes over.
Fair enough if you just want to talk about outer alignment.
I agree that this implies that the utility function you get in Step 2 will be good and will continue working OOD.
I assumed that in Step 3, you would plug that utility function as the reward function into an algorithm like PPO in order to train a policy that acted well. The issue is then that the resulting policy could end up optimizing for something else OOD, even if the utility function would have done the right thing, in the same way that the CoinRun policy ends up always going to the end of the level even though it was trained on the desired reward function of “+10 if you get the coin, 0 otherwise”.
Maybe you have some different Step 3 in mind besides “run PPO”?
Thanks, this is indeed a point I hadn’t fully appreciated: even if a reward function generalizes well OOD, that doesn’t mean that a policy trained on that reward function does.
It seems like the issue here is that it’s a bad idea to ever take your policy offline, analogously to what happens in reward modeling from human feedback (namely, reward models stops being good once you take them offline). Does that seem right? Of course, keeping an RL agent in learning mode forever might also have issues, most obviously unsafe exploration. Are there other things that also go wrong?
I agree that one major mitigation is to keep training your policy online, but that doesn’t necessarily prevent a misaligned policy from taking over the world before the training has time to fix its mistakes. In particular, if the policy is reasoning “I’ll behave well until the moment I strike”, and your reward function can’t detect that (it only detects whether the output was good), then the policy will look great until the moment it takes over.