If your definition of “aligned” includes “this AI will delight in murdering me as soon as it can do so without getting caught and punished, but currently it can’t do that, so instead it is being helpful” … then I don’t think you are defining the term “aligned” in a reasonable way.
More specifically, if you use the word “aligned” for an AI that wants to murder me as soon as it can get away with it (but it can’t), then that doesn’t leave us with good terminology to discuss how to make an AI that doesn’t want to murder me.
Why not just say “this AI is currently emitting outputs that I like” instead of “this AI is locally aligned”? Are we losing anything that way?
If your definition of “aligned” includes “this AI will delight in murdering me as soon as it can do so without getting caught and punished, but currently it can’t do that, so instead it is being helpful” … then I don’t think you are defining the term “aligned” in a reasonable way.
More specifically, if you use the word “aligned” for an AI that wants to murder me as soon as it can get away with it (but it can’t), then that doesn’t leave us with good terminology to discuss how to make an AI that doesn’t want to murder me.
Why not just say “this AI is currently emitting outputs that I like” instead of “this AI is locally aligned”? Are we losing anything that way?