I want to defend the term Goal Misgeneralization. (Steven Byrnes makes a similar point in another comment).
I think what’s misgeneralizing is the “behavioral goal” of the system: a goal that you can ascribe to a system to accurately model its behavior. Goal misgeneralization does not refer to the innate goal of the system. (In fact, I think this perspective is trying to avoid thorny discussions of these topics, partly because people in ML are averse to philosophy.)
For example, the coin run agent pursues the coin in training, but when the coin is put on the other side of the level it still just goes to the right. In training, the agent could have been modeled as having a bunch of goals including getting the coin, getting to the right of the maze, and maximizing the reward it gets. By putting the coin on the left side of the maze we see that its behavior cannot always be modeled by the goal of getting the coin and we get misgeneralization.
This is analogous to a Husky classifier that learns to classify whether the dog is on snow. Here, the models behavior can be explained by classifying any number of things about the image, including whether the pictured dog is a Husky and whether the pictured dog is in snow. These things come apart when you show it a Husky that’s not standing in snow and we get “concept misgeneralization”.
I want to defend the term Goal Misgeneralization. (Steven Byrnes makes a similar point in another comment).
I think what’s misgeneralizing is the “behavioral goal” of the system: a goal that you can ascribe to a system to accurately model its behavior. Goal misgeneralization does not refer to the innate goal of the system. (In fact, I think this perspective is trying to avoid thorny discussions of these topics, partly because people in ML are averse to philosophy.)
For example, the coin run agent pursues the coin in training, but when the coin is put on the other side of the level it still just goes to the right. In training, the agent could have been modeled as having a bunch of goals including getting the coin, getting to the right of the maze, and maximizing the reward it gets. By putting the coin on the left side of the maze we see that its behavior cannot always be modeled by the goal of getting the coin and we get misgeneralization.
This is analogous to a Husky classifier that learns to classify whether the dog is on snow. Here, the models behavior can be explained by classifying any number of things about the image, including whether the pictured dog is a Husky and whether the pictured dog is in snow. These things come apart when you show it a Husky that’s not standing in snow and we get “concept misgeneralization”.