What would happen if we set an algorithm inside the AGI assigning negative infinite utility to any action which modifies its own utility function and said algorithm itself?
An argument that is fairly accepted here is that even this is not necessary. If Gandhi could take a pill that would make him okay with murdering people, he wouldn’t do it because this would lead to him murdering people, something he doesn’t want now. (See http://lesswrong.com/lw/2vj/gandhi_murder_pills_and_mental_illness/)
Similarly, if we can link an AI’s utility function to the actual state of the world, and not just how it perceives the world, then it wouldn’t modify its utility because even though its potential future self would think it has more utility, its present self identifies this future as having less utility.
Does this simplify to the AI obeying: “Modify my utility function if and only if the new version is likely to result in more utility according to the current version?”
If so, something about it feels wrong. For one thing, I’m not sure how an AI following such a rule would ever conclude it should change the function. If it can only make changes that result in maximizing the current function, why not just keep the current one and continue maximizing it?
That’s the point, that it would almost never change it’s underlying utility function. Once we have a provably friendly FAI, we wouldn’t want it to change the part that makes its friendly.
Now, it could still change how it goes about achieving it’s utility function, as long as that helps it get more utility, so it would still be self-modifying.
There is a chance that it could change (E.g. if you were naturally a 2-boxer on Newcomb’s Problem, you might self-modify to do a one-boxer). But, those cases are rare.
An argument that is fairly accepted here is that even this is not necessary. If Gandhi could take a pill that would make him okay with murdering people, he wouldn’t do it because this would lead to him murdering people, something he doesn’t want now. (See http://lesswrong.com/lw/2vj/gandhi_murder_pills_and_mental_illness/)
Similarly, if we can link an AI’s utility function to the actual state of the world, and not just how it perceives the world, then it wouldn’t modify its utility because even though its potential future self would think it has more utility, its present self identifies this future as having less utility.
Does this simplify to the AI obeying: “Modify my utility function if and only if the new version is likely to result in more utility according to the current version?”
If so, something about it feels wrong. For one thing, I’m not sure how an AI following such a rule would ever conclude it should change the function. If it can only make changes that result in maximizing the current function, why not just keep the current one and continue maximizing it?
That’s the point, that it would almost never change it’s underlying utility function. Once we have a provably friendly FAI, we wouldn’t want it to change the part that makes its friendly.
Now, it could still change how it goes about achieving it’s utility function, as long as that helps it get more utility, so it would still be self-modifying.
There is a chance that it could change (E.g. if you were naturally a 2-boxer on Newcomb’s Problem, you might self-modify to do a one-boxer). But, those cases are rare.