There are decision theories that just try to do the right thing without needing to modify themselves. One obvious example is the decision rule “do the thing I would have self-modified to choose if I could have.” So even in situations like the Twin Prisoners’ Dilemma, you won’t necessarily have an incentive to self-modify.
But if there are situations that depend on the AI’s source code, and not just what decisions it would make, then yes, there can be incentives for self-modification. But there are also incentives for hacking the computer you’re running on, or figuring out how to lie to the human to get what you want. Which of these wins out depends on the details, and doesn’t seem amenable to a mathematical proof.
There are decision theories that just try to do the right thing without needing to modify themselves. One obvious example is the decision rule “do the thing I would have self-modified to choose if I could have.” So even in situations like the Twin Prisoners’ Dilemma, you won’t necessarily have an incentive to self-modify.
But if there are situations that depend on the AI’s source code, and not just what decisions it would make, then yes, there can be incentives for self-modification. But there are also incentives for hacking the computer you’re running on, or figuring out how to lie to the human to get what you want. Which of these wins out depends on the details, and doesn’t seem amenable to a mathematical proof.