Lesswrong has a [trove of thought experiments](https://www.lesswrong.com/posts/PcfHSSAMNFMgdqFyB/can-you-control-the-past) about scenarios where arguably the best way to maximize your utility is to verifiably (with some probability) modify your own utility function, starting with the prisoner’s dilemma and extending to games with superintelligences predicting what you will do and putting money in boxes etc.
These thought experiments seem to have real world reflections: for example, voting is pretty much irrational under CDT, but paradoxically the outcomes of elections correlate with the utility functions of people who vote, and people who grow up in high trust societies do better than people who grow up in low trust societies, even though defecting is rational.
In addition, humans have an astonishing capability for modifying our own utility functions, such as by joining religions, gaining or losing empathy for animals, etc.
Is it plausible that we could analytically prove that under a training environment rich in these sorts of scenarios, an AGI that wants to maximize an initially bad utility function would develop the capability to verifiably (with some probability) modify it’s own utility function like people do in order to survive and be released into the world?
There are decision theories that just try to do the right thing without needing to modify themselves. One obvious example is the decision rule “do the thing I would have self-modified to choose if I could have.” So even in situations like the Twin Prisoners’ Dilemma, you won’t necessarily have an incentive to self-modify.
But if there are situations that depend on the AI’s source code, and not just what decisions it would make, then yes, there can be incentives for self-modification. But there are also incentives for hacking the computer you’re running on, or figuring out how to lie to the human to get what you want. Which of these wins out depends on the details, and doesn’t seem amenable to a mathematical proof.
Lesswrong has a [trove of thought experiments](https://www.lesswrong.com/posts/PcfHSSAMNFMgdqFyB/can-you-control-the-past) about scenarios where arguably the best way to maximize your utility is to verifiably (with some probability) modify your own utility function, starting with the prisoner’s dilemma and extending to games with superintelligences predicting what you will do and putting money in boxes etc.
These thought experiments seem to have real world reflections: for example, voting is pretty much irrational under CDT, but paradoxically the outcomes of elections correlate with the utility functions of people who vote, and people who grow up in high trust societies do better than people who grow up in low trust societies, even though defecting is rational.
In addition, humans have an astonishing capability for modifying our own utility functions, such as by joining religions, gaining or losing empathy for animals, etc.
Is it plausible that we could analytically prove that under a training environment rich in these sorts of scenarios, an AGI that wants to maximize an initially bad utility function would develop the capability to verifiably (with some probability) modify it’s own utility function like people do in order to survive and be released into the world?
There are decision theories that just try to do the right thing without needing to modify themselves. One obvious example is the decision rule “do the thing I would have self-modified to choose if I could have.” So even in situations like the Twin Prisoners’ Dilemma, you won’t necessarily have an incentive to self-modify.
But if there are situations that depend on the AI’s source code, and not just what decisions it would make, then yes, there can be incentives for self-modification. But there are also incentives for hacking the computer you’re running on, or figuring out how to lie to the human to get what you want. Which of these wins out depends on the details, and doesn’t seem amenable to a mathematical proof.