The model’s internal representations of aligned behavior need not change very much during this shift; the only difference might be that aligned behavior shifts from being a terminal goal to an instrumental goal.
Relevant post, in case you haven’t seen it: https://www.lesswrong.com/posts/8qCKZj24FJotm3EKd/ultimate-ends-may-be-easily-hidable-behind-convergent
Relevant post, in case you haven’t seen it: https://www.lesswrong.com/posts/8qCKZj24FJotm3EKd/ultimate-ends-may-be-easily-hidable-behind-convergent