it follows its programming, not its knowledge about what its programming is “meant” to be (unless we’ve successfully programmed in “do what I mean”, which is basically the whole of the challenge).
Not necessarily. The instructions to a fully-reflective AI could be more along the lines of “learn what I mean, then do that” or “do what I asked within the constraints of my own unstated principles.” The AI would have an imperative to build a more accurate internal model of your psychology in order to predict the implicit constraints applied to that request, typically by asking you or other trusted humans questions. If you want to take this to a crazy extreme, it is perhaps more probable that the AI would recognize that military campaigns to acquire ore deposits is both outside its recorded experiences and not directly implied by your request. It would then take the prudent step of constructing a passive brain scanner (perhaps developing molecular nanotechnology first, in order to do so), clandestinely scan you while on vacation, and use that knowledge to refine the utility function into something you would be happy with (i.e. not declaring war on humanity).
“learn what I mean, then do that” or “do what I asked within the constraints of my own unstated principles.”
That’s just another way of saying “do what I mean”. And it doesn’t give us the code to implement that.
“Do what I asked within the constraints of my own unstated principles” is a hugely complicated set of instructions, that only seem simple because it’s written in English words.
That’s just another way of saying “do what I mean”. And it doesn’t give us the code to implement that.
I thought this was quite clear, but maybe not. Let’s play taboo with the phrase “do what I mean.”
“Do what I asked within the constraints of my own unstated principles”
“Bring about the end-goal I requested, without in the process taking actions that I would not approve of”
“Develop a predictive model of my psychology, and evaluate solutions to the stated task against that model. When a solution matches the goal but rejected by the model, do not take that action until the conflict is resolved. Resolving the conflict will require either clarification of the task to exclude such possibilities (which can be done automatically if I have a high-confidence theory for why the task was not further specified), or updating the psychological model of my creators to match empirical reality.”
Do you see now how that is implementable?
EDIT: To be clear, for a variety of reasons I don’t think it is a good idea to build a “do what I mean” AI, unless “do what I mean” is generalized to the reflective equilibrium of all of humanity. But that’s the way the paperclip argument is posed.
Do you think that a human rule lawyer, someone built to manipulate rules and regulations, could not argue there way through this, sticking with all the technical requirements but getting completely different outcomes? I know I could.
And if a human rule-lawyer could do it, that means that there exists ways of satisfying the formal criteria without doing what we want. Once we know these exist, the question is then: would the AI stumble preferentially on the solution we had in mind? Why would we expect it to do so when we haven’t even been able to specify that solution?
Do you think that a human rule lawyer, someone built to manipulate rules and regulations, could not argue there way through this, sticking with all the technical requirements but getting completely different outcomes? I know I could.
The question isn’t whether there is one solution, but whether the space of possible solutions is encompassed by acceptable morals. I would not “expect an AI to stumble preferentially on the solution we had in mind” because I am confused and do not know what the solution is, as are you and everyone else on LessWrong. However that is a separate issue from whether we can specify what a solution would look like, such as a reflective-equilibrium solution to the coherent extrapolated volition of humankind. You can write an optimizer to search for a description of CEV without actually knowing what the result will be.
It’s like saying “I want to calculate pi to the billionth digit” and writing a program to do it, then arguing that we can’t be sure the result is correct because we don’t know ahead of time what the billionth digit of pi will be. Nonsense.
Not necessarily. The instructions to a fully-reflective AI could be more along the lines of “learn what I mean, then do that” or “do what I asked within the constraints of my own unstated principles.” The AI would have an imperative to build a more accurate internal model of your psychology in order to predict the implicit constraints applied to that request, typically by asking you or other trusted humans questions. If you want to take this to a crazy extreme, it is perhaps more probable that the AI would recognize that military campaigns to acquire ore deposits is both outside its recorded experiences and not directly implied by your request. It would then take the prudent step of constructing a passive brain scanner (perhaps developing molecular nanotechnology first, in order to do so), clandestinely scan you while on vacation, and use that knowledge to refine the utility function into something you would be happy with (i.e. not declaring war on humanity).
That’s just another way of saying “do what I mean”. And it doesn’t give us the code to implement that.
“Do what I asked within the constraints of my own unstated principles” is a hugely complicated set of instructions, that only seem simple because it’s written in English words.
I thought this was quite clear, but maybe not. Let’s play taboo with the phrase “do what I mean.”
“Do what I asked within the constraints of my own unstated principles”
“Bring about the end-goal I requested, without in the process taking actions that I would not approve of”
“Develop a predictive model of my psychology, and evaluate solutions to the stated task against that model. When a solution matches the goal but rejected by the model, do not take that action until the conflict is resolved. Resolving the conflict will require either clarification of the task to exclude such possibilities (which can be done automatically if I have a high-confidence theory for why the task was not further specified), or updating the psychological model of my creators to match empirical reality.”
Do you see now how that is implementable?
EDIT: To be clear, for a variety of reasons I don’t think it is a good idea to build a “do what I mean” AI, unless “do what I mean” is generalized to the reflective equilibrium of all of humanity. But that’s the way the paperclip argument is posed.
No.
Do you think that a human rule lawyer, someone built to manipulate rules and regulations, could not argue there way through this, sticking with all the technical requirements but getting completely different outcomes? I know I could.
And if a human rule-lawyer could do it, that means that there exists ways of satisfying the formal criteria without doing what we want. Once we know these exist, the question is then: would the AI stumble preferentially on the solution we had in mind? Why would we expect it to do so when we haven’t even been able to specify that solution?
The question isn’t whether there is one solution, but whether the space of possible solutions is encompassed by acceptable morals. I would not “expect an AI to stumble preferentially on the solution we had in mind” because I am confused and do not know what the solution is, as are you and everyone else on LessWrong. However that is a separate issue from whether we can specify what a solution would look like, such as a reflective-equilibrium solution to the coherent extrapolated volition of humankind. You can write an optimizer to search for a description of CEV without actually knowing what the result will be.
It’s like saying “I want to calculate pi to the billionth digit” and writing a program to do it, then arguing that we can’t be sure the result is correct because we don’t know ahead of time what the billionth digit of pi will be. Nonsense.
Whether the space of possible solutions is contained in the space of moral outcomes.
Correct.