Anja, this is a fantastic post. It’s very clear, easy to read, and it made a lot of sense to me (and I have very little background in thinking about this sort of stuff). Thanks for writing it up! I can understand several issues a lot more clearly now, especially how easy (and tempting) it is for an agent that has access to its source code to wirehead itself.
I agree with Alexei, this has just now helped me a lot.
Although I now have to ask a stupid question; please have pity on me, I’m new to the site and I have little knowledge to work of.
What would happen if we set an algorithm inside the AGI assigning negative infinite utility to any action which modifies its own utility function and said algorithm itself?
This within reasonable parameters; ideally, it could change its utility function but only in certain pre approved paths, so that it could actually move around.
Reasonable here is a magic word, in the sense that it’s a block box which I don’t know how to map out
What would happen if we set an algorithm inside the AGI assigning negative infinite utility to any action which modifies its own utility function and said algorithm itself?
There are several problems with this approach: First of all how do you specify all actions that modify the utility function? How likely do you think it is that you can exhaustively specify all sequences of actions that lead to modification of the utility function in a practical implementation? Experience with cryptography has taught us, that there is almost always some side channel attack that the original developers have not thought of, and that is just in the case of human vs. human intelligence.
Forbidden actions in general seem like a bad idea with an AGI that is smarter than us, see for example the AI Box experiment.
Then there is the problem that we actually don’t want any part of the AGI to be unmodifiable. The agent might revise its model of how the universe works (like we did when we went from Newtonian physics to quantum mechanics) and then it has to modify its utility function or it is left with gibberish.
All that said, I think what you described corresponds to the hack evolution has used on us: We have acquired a list of things (or schemas) that will mess up our utility functions and reduce agency and those just feel icky to us, like the experience machine or electrical stimulation of the brain. But we don’t have the luxury of learning by making lots and lots of mistakes that evolution had.
I think you intuition is basically right. AGI will have to change its utility function, the answer is basically how/why? For FAI, we want to make sure that all future modifications will preserve the “friendly” aspect, which is very difficult to ensure (we don’t have the necessary math for that right now).
What would happen if we set an algorithm inside the AGI assigning negative infinite utility to any action which modifies its own utility function and said algorithm itself?
An argument that is fairly accepted here is that even this is not necessary. If Gandhi could take a pill that would make him okay with murdering people, he wouldn’t do it because this would lead to him murdering people, something he doesn’t want now. (See http://lesswrong.com/lw/2vj/gandhi_murder_pills_and_mental_illness/)
Similarly, if we can link an AI’s utility function to the actual state of the world, and not just how it perceives the world, then it wouldn’t modify its utility because even though its potential future self would think it has more utility, its present self identifies this future as having less utility.
Does this simplify to the AI obeying: “Modify my utility function if and only if the new version is likely to result in more utility according to the current version?”
If so, something about it feels wrong. For one thing, I’m not sure how an AI following such a rule would ever conclude it should change the function. If it can only make changes that result in maximizing the current function, why not just keep the current one and continue maximizing it?
That’s the point, that it would almost never change it’s underlying utility function. Once we have a provably friendly FAI, we wouldn’t want it to change the part that makes its friendly.
Now, it could still change how it goes about achieving it’s utility function, as long as that helps it get more utility, so it would still be self-modifying.
There is a chance that it could change (E.g. if you were naturally a 2-boxer on Newcomb’s Problem, you might self-modify to do a one-boxer). But, those cases are rare.
Anja, this is a fantastic post. It’s very clear, easy to read, and it made a lot of sense to me (and I have very little background in thinking about this sort of stuff). Thanks for writing it up! I can understand several issues a lot more clearly now, especially how easy (and tempting) it is for an agent that has access to its source code to wirehead itself.
I agree with Alexei, this has just now helped me a lot.
Although I now have to ask a stupid question; please have pity on me, I’m new to the site and I have little knowledge to work of.
What would happen if we set an algorithm inside the AGI assigning negative infinite utility to any action which modifies its own utility function and said algorithm itself?
This within reasonable parameters; ideally, it could change its utility function but only in certain pre approved paths, so that it could actually move around.
Reasonable here is a magic word, in the sense that it’s a block box which I don’t know how to map out
There are several problems with this approach: First of all how do you specify all actions that modify the utility function? How likely do you think it is that you can exhaustively specify all sequences of actions that lead to modification of the utility function in a practical implementation? Experience with cryptography has taught us, that there is almost always some side channel attack that the original developers have not thought of, and that is just in the case of human vs. human intelligence.
Forbidden actions in general seem like a bad idea with an AGI that is smarter than us, see for example the AI Box experiment.
Then there is the problem that we actually don’t want any part of the AGI to be unmodifiable. The agent might revise its model of how the universe works (like we did when we went from Newtonian physics to quantum mechanics) and then it has to modify its utility function or it is left with gibberish.
All that said, I think what you described corresponds to the hack evolution has used on us: We have acquired a list of things (or schemas) that will mess up our utility functions and reduce agency and those just feel icky to us, like the experience machine or electrical stimulation of the brain. But we don’t have the luxury of learning by making lots and lots of mistakes that evolution had.
I think you intuition is basically right. AGI will have to change its utility function, the answer is basically how/why? For FAI, we want to make sure that all future modifications will preserve the “friendly” aspect, which is very difficult to ensure (we don’t have the necessary math for that right now).
An argument that is fairly accepted here is that even this is not necessary. If Gandhi could take a pill that would make him okay with murdering people, he wouldn’t do it because this would lead to him murdering people, something he doesn’t want now. (See http://lesswrong.com/lw/2vj/gandhi_murder_pills_and_mental_illness/)
Similarly, if we can link an AI’s utility function to the actual state of the world, and not just how it perceives the world, then it wouldn’t modify its utility because even though its potential future self would think it has more utility, its present self identifies this future as having less utility.
Does this simplify to the AI obeying: “Modify my utility function if and only if the new version is likely to result in more utility according to the current version?”
If so, something about it feels wrong. For one thing, I’m not sure how an AI following such a rule would ever conclude it should change the function. If it can only make changes that result in maximizing the current function, why not just keep the current one and continue maximizing it?
That’s the point, that it would almost never change it’s underlying utility function. Once we have a provably friendly FAI, we wouldn’t want it to change the part that makes its friendly.
Now, it could still change how it goes about achieving it’s utility function, as long as that helps it get more utility, so it would still be self-modifying.
There is a chance that it could change (E.g. if you were naturally a 2-boxer on Newcomb’s Problem, you might self-modify to do a one-boxer). But, those cases are rare.
Thank you.