I would not call 1) an instance of goal misgeneralization. Goal misgeneralization only occurs if the model does badly at the training objective. If you reward an RL agent for making humans happy and it goes on to make humans happy in unintended ways like putting them into heroin cells, the RL agent is doing fine on the training objective. I’d call 1) an instance of misspecification and 2) an instance of misgeneralization.
(AFAICT The Alignment Problem from a DL Perspective uses the term in the same way I do, but I’d have to reread more carefully to make sure).
I agree with much of the rest of this post, eg the paragraphs beginning with “The solutions to these two problems are pretty different.”
A deep RL agent is trained to maximize a reward R:S×A×S→R, where S and A are the sets of all valid states and actions, respectively. Assume that the agent is deployed out-of-distribution; that is, an aspect of the environment (and therefore the distribution of observations) changes at test time.
\textbf{Goal misgeneralization} occurs if the agent now achieves low reward in the new environment because it continues to act capably yet appears to optimize a different reward R′≠R. We call R the \textbf{intended objective} and R′ the \textbf{behavioral objective} of the agent.
FWIW I think this definition is flawed in many ways (for example, the type signature of the agent’s inner goal is different from that of the reward function, bc the agent might have an inner world model that extends beyond the RL environment’s state space; and also it’s generally sketchy to extend the reward function beyond the training distribution), but I don’t know of a different definition that doesn’t have similarly-sized flaws.
I would not call 1) an instance of goal misgeneralization. Goal misgeneralization only occurs if the model does badly at the training objective. If you reward an RL agent for making humans happy and it goes on to make humans happy in unintended ways like putting them into heroin cells, the RL agent is doing fine on the training objective. I’d call 1) an instance of misspecification and 2) an instance of misgeneralization.
(AFAICT The Alignment Problem from a DL Perspective uses the term in the same way I do, but I’d have to reread more carefully to make sure).
I agree with much of the rest of this post, eg the paragraphs beginning with “The solutions to these two problems are pretty different.”
Here’s our definition in the RL setting for reference (from https://arxiv.org/abs/2105.14111):
FWIW I think this definition is flawed in many ways (for example, the type signature of the agent’s inner goal is different from that of the reward function, bc the agent might have an inner world model that extends beyond the RL environment’s state space; and also it’s generally sketchy to extend the reward function beyond the training distribution), but I don’t know of a different definition that doesn’t have similarly-sized flaws.