This is the whole point of goal misgeneralization. They have experiments (albeit on toy environments that can be explained by the network finding the wrong algorithm), so I’d say quite plausible.
I guess the answer is yes then! (I think I now remember seeing a video about that.)
This is the whole point of goal misgeneralization. They have experiments (albeit on toy environments that can be explained by the network finding the wrong algorithm), so I’d say quite plausible.
I guess the answer is yes then! (I think I now remember seeing a video about that.)