But I think the latter is a case of inner misalignment. It’s like the example “you teach your AI to play a game and find the apple in the labyrinth, but because you always put the apple in the lower right corner, turns out you just taught it to go in the lower right corner”. How is it different? You taught it about what you thought is happiness but it picked up on a few accidental features that just happened to correlate with it in your training examples.
But I think the latter is a case of inner misalignment. It’s like the example “you teach your AI to play a game and find the apple in the labyrinth, but because you always put the apple in the lower right corner, turns out you just taught it to go in the lower right corner”. How is it different? You taught it about what you thought is happiness but it picked up on a few accidental features that just happened to correlate with it in your training examples.