In both cases, the AI behaves (during training) in a way that looks a lot like trying to make people happy. Then the AI described in (1) is unfriendly because it was optimizing the wrong concept of “happiness”, one that lined up with yours when the AI was weak, but that diverges in various edge-cases that matter when the AI is strong. By contrast, the AI described in (2) was never even really trying to pursue happiness; it had a mixture of goals that merely correlated with the training objective, and that balanced out right around where you wanted them to balance out in training, but deployment (and the corresponding capabilities-increases) threw the balance off.
I don’t quite understand the distinction your’e drawing here.
In both cases the AI was never trying to pursue happiness. In both cases it was pursuing something else, shmappiness, that correlated strongly with causing happiness in the training but not deployment environments. In both cases strength matters for making this disastrous as it will find more disastrous ways of pursuing schmappiness, It’s just that the it is pursuing different varieties of shmappiness in the different cases.
I don’t have a view on whether “goal misgeneralisation” as a term is optimal for this kind of thing.
I don’t quite understand the distinction your’e drawing here.
In both cases the AI was never trying to pursue happiness. In both cases it was pursuing something else, shmappiness, that correlated strongly with causing happiness in the training but not deployment environments. In both cases strength matters for making this disastrous as it will find more disastrous ways of pursuing schmappiness, It’s just that the it is pursuing different varieties of shmappiness in the different cases.
I don’t have a view on whether “goal misgeneralisation” as a term is optimal for this kind of thing.