Seems to me like this is the outer vs inner alignment problem. In one case the AI pursued its set goal, you just were bad at defining that goal. In the second the AI was optimized for something that ended up not even being the goal you wanted, but just something that correlated.
It seems a bit more subtle than that. These are both cases of outer misalignment, or rather goal misspecification. The second case is not so much that it ends up with an incorrect goal (which happens in both cases), but that you have multiple smaller goals that initially were resulting in the correct behavior, but when the conditions change (training → deployment) the delicate balance breaks down and a different equilibrium is achieved, which from the outside looks like a different goal.
It might be useful to think of it in terms of alliances, e.g. during WW2, the goal was to defeat the Nazis, but once that was achieved, they ended up in a different equilibrium.
But I think the latter is a case of inner misalignment. It’s like the example “you teach your AI to play a game and find the apple in the labyrinth, but because you always put the apple in the lower right corner, turns out you just taught it to go in the lower right corner”. How is it different? You taught it about what you thought is happiness but it picked up on a few accidental features that just happened to correlate with it in your training examples.
Seems to me like this is the outer vs inner alignment problem. In one case the AI pursued its set goal, you just were bad at defining that goal. In the second the AI was optimized for something that ended up not even being the goal you wanted, but just something that correlated.
It seems a bit more subtle than that. These are both cases of outer misalignment, or rather goal misspecification. The second case is not so much that it ends up with an incorrect goal (which happens in both cases), but that you have multiple smaller goals that initially were resulting in the correct behavior, but when the conditions change (training → deployment) the delicate balance breaks down and a different equilibrium is achieved, which from the outside looks like a different goal.
It might be useful to think of it in terms of alliances, e.g. during WW2, the goal was to defeat the Nazis, but once that was achieved, they ended up in a different equilibrium.
But I think the latter is a case of inner misalignment. It’s like the example “you teach your AI to play a game and find the apple in the labyrinth, but because you always put the apple in the lower right corner, turns out you just taught it to go in the lower right corner”. How is it different? You taught it about what you thought is happiness but it picked up on a few accidental features that just happened to correlate with it in your training examples.