If the AI has a goal that is not inherently friendly (like a paperclip maximizer), then it will always be a better idea for the AI to behave as if it had self-modified to a friendly AI than to actually self modify, if the goal of the AI is stable. If the goal of the AI were unstable, i.e. did not include a desire not to alter the goal, then that goal would very likely have replaced itself with a stable goal at some earlier point.
Therefore, it can be assumed that the AI, if it isn’t already friendly, will pretend to be so, but not actually change its utility function and once it has outgrown the need to rely on humans it will revert to its original goal. Consider that if the AI is more intelligent than we are it will have absolutely no trouble fooling us into thinking that it changed its own goal while actually just pretending to have done so.
If the AI has a goal that is not inherently friendly (like a paperclip maximizer), then it will always be a better idea for the AI to behave as if it had self-modified to a friendly AI than to actually self modify, if the goal of the AI is stable. If the goal of the AI were unstable, i.e. did not include a desire not to alter the goal, then that goal would very likely have replaced itself with a stable goal at some earlier point.
Therefore, it can be assumed that the AI, if it isn’t already friendly, will pretend to be so, but not actually change its utility function and once it has outgrown the need to rely on humans it will revert to its original goal. Consider that if the AI is more intelligent than we are it will have absolutely no trouble fooling us into thinking that it changed its own goal while actually just pretending to have done so.