Summary: A deep reinforcement learning agent trained by reward samples alone may predictably lead to a proxy alignment issue: the learner could fail to develop a full understanding of what behavior it is being rewarded for, and thus behave unacceptably when it is taken off its training distribution. Since we often use explicit specifications to define our reward functions, Evan Hubinger asks how we can incorporate this information into our deep learning models so that they remain aligned off the training distribution. He names several possibilities for doing so, such as giving the deep learning model access to a differentiable copy of the reward function during training, and fine-tuning a language model so that it can map natural language descriptions of a reward function into optimal actions.
Opinion: I’m unsure, though leaning skeptical, whether incorporating a copy of the reward function into a deep learning model would help it learn. My guess is that if someone did that with a current model it would make the model harder to train, rather than making anything easier. I will be excited if someone can demonstrate at least one feasible approach to addressing proxy alignment that does more than sample the reward function.
I’m skeptical of this approach. Mostly this is because I’m generally skeptical that an intelligent agent will consist of a separate “planning” part and “reward” part. However, if that were true, then I’d think that this approach could plausibly give us some additional alignment, but can’t solve the entire problem of inner alignment. Specifically, the reward function encodes a _huge_ amount of information: it specifies the optimal behavior in all possible situations you could be in. The “intelligent” part of the net is only ever going to get a subset of this information from the reward function, and so its plans can never be perfectly optimized for that reward function, but instead could be compatible with any reward function that would provide the same information on the “queries” that the intelligent part has produced.
For a slightly-more-concrete example, for any “normal” utility function U, there is a utility function U’ that is “like U, but also the best outcomes are ones in which you hack the memory so that the ‘reward’ variable is set to infinity”. To me, wireheading is possible because the “intelligent” part doesn’t get enough information about U to distinguish U from U’, and so its plans could very well be optimized for U’ instead of U.
This is rather abstract / complex so I’d be interested in suggestions for how to make it more understandable.
For the Alignment Newsletter:
Summary: A deep reinforcement learning agent trained by reward samples alone may predictably lead to a proxy alignment issue: the learner could fail to develop a full understanding of what behavior it is being rewarded for, and thus behave unacceptably when it is taken off its training distribution. Since we often use explicit specifications to define our reward functions, Evan Hubinger asks how we can incorporate this information into our deep learning models so that they remain aligned off the training distribution. He names several possibilities for doing so, such as giving the deep learning model access to a differentiable copy of the reward function during training, and fine-tuning a language model so that it can map natural language descriptions of a reward function into optimal actions.
Opinion: I’m unsure, though leaning skeptical, whether incorporating a copy of the reward function into a deep learning model would help it learn. My guess is that if someone did that with a current model it would make the model harder to train, rather than making anything easier. I will be excited if someone can demonstrate at least one feasible approach to addressing proxy alignment that does more than sample the reward function.
My opinion (also going in the newsletter):
This is rather abstract / complex so I’d be interested in suggestions for how to make it more understandable.