Rohin Shah comments on If you’re very optimistic about ELK then you should be optimistic about outer alignment

Rohin Shah 28 Apr 2022 19:43 UTC
3 points
Fair enough if you just want to talk about outer alignment.
That said, I think the particular concern expressed in the paper you link—namely, that the agent’s reward model could break OOD while the agent’s capabilities remain otherwise intact—doesn’t seem like it would be an issue here? Indeed, the agent’s reward model is pulled out of its world model, so if the world model keeps working OOD (i.e. keeps making good predictions about human behavior, which depend on good predictions about what humans value) then the reward model should keep working as well.
I agree that this implies that the utility function you get in Step 2 will be good and will continue working OOD.
I assumed that in Step 3, you would plug that utility function as the reward function into an algorithm like PPO in order to train a policy that acted well. The issue is then that the resulting policy could end up optimizing for something else OOD, even if the utility function would have done the right thing, in the same way that the CoinRun policy ends up always going to the end of the level even though it was trained on the desired reward function of “+10 if you get the coin, 0 otherwise”.
Maybe you have some different Step 3 in mind besides “run PPO”?
- Sam Marks 28 Apr 2022 21:13 UTC
  1 point
  Parent
  Thanks, this is indeed a point I hadn’t fully appreciated: even if a reward function generalizes well OOD, that doesn’t mean that a policy trained on that reward function does.
  It seems like the issue here is that it’s a bad idea to ever take your policy offline, analogously to what happens in reward modeling from human feedback (namely, reward models stops being good once you take them offline). Does that seem right? Of course, keeping an RL agent in learning mode forever might also have issues, most obviously unsafe exploration. Are there other things that also go wrong?
  - Rohin Shah 29 Apr 2022 7:25 UTC
    2 points
    Parent
    I agree that one major mitigation is to keep training your policy online, but that doesn’t necessarily prevent a misaligned policy from taking over the world before the training has time to fix its mistakes. In particular, if the policy is reasoning “I’ll behave well until the moment I strike”, and your reward function can’t detect that (it only detects whether the output was good), then the policy will look great until the moment it takes over.