I see a fair amount of back-and-forth where someone says “What about this?” and you say “I addressed that in several places; clearly you didn’t read it.” Unfortunately, while you may think you have addressed the various issues, I don’t think you did (and presumably your interlocutors don’t). Perhaps you will humor me in responding to my comment. Let me try and make the issue as sharp as possible by pointing out what I think is an out-and-out mistake made by you. In the section you call the heart of your argument, you say.
If the AI is superintelligent (and therefore unstoppable), it will be smart enough to know all about its own limitations when it comes to the business of reasoning about the world and making plans of action. But if it is also programmed to utterly ignore that fallibility—for example, when it follows its compulsion to put everyone on a dopamine drip, even though this plan is clearly a result of a programming error—then we must ask the question: how can the machine be both superintelligent and able to ignore a gigantic inconsistency in its reasoning?
Yes, the outcome is clearly the result of a “programming error” (in some sense). However, you then ask how a superintelligent machine could ignore such an “inconsistency in its reasoning.” But a programming error is not the same thing as an inconsistency in reasoning.
Note: I want to test your argument (at least at first), so I would rather not get a response from you claiming I’ve failed to take into account other arguments or other evidence, therefore my objection is invalid. Let me propose that you either 1) dispute that this was, in fact, a mistake, 2) explain how I have misunderstood, 3) grant that it was a mistake, and reformulate the claim here, or 4) state that this claim is not necessary for your argument.
If you can help me understand this point, I would be happy to continue to engage.
I did not claim (as you imply) that the fact of there being a programming error was what implied that there is “an inconsistency in its reasoning.” In the two paragraphs immediately before the one you quote (and, indeed, in that whole section), I explain that the system KNOWS that it is following these two imperatives:
1) Conclusions produced by my reasoning engine are always correct. [This is the Doctrine of Logical Infallibility]
2) I know that AGI reasoning engines in general, and mine in particular, sometimes come to incorrect conclusions that are the result of a failure in their design.
Or, paraphrasing this in the simplest possible way:
1) My reasoning engine is infallible.
2) My reasoning engine is fallible.
That, right there, is a flat-out contradiction between two of its core “beliefs”. It is not, as you state, that the existence of a programming error is evidence of inconsistency, it is the above pair of beliefs (engendered by the programming error) that constitute the inconsistency.
Thanks for replying. Yes it does help. My apologies. I think I misunderstood your argument initially. I confess I still don’t see how it works though.
You criticize the doctrine of logical infallibility, claiming that a truly intelligent AI would not believe such a thing. Maybe so. I’ll set the question aside for now. My concern is that I don’t think this doctrine is an essential part of the arguments or scenarios that Yudkowsky et al present.
An intelligent AI might come to a conclusion about what it ought to do, and then recognize “yes, I might be wrong about this” (whatever is meant by “wrong”—this is not at all clear). The AI might always recognize this possibility about every one of its conclusions. Still, so what? Does this mean it won’t act?
Can you tell me how you feel about the following two options? Or, if you prefer a third option, could you explain it? You could
1) explicitly program the AI to ask the programmers about every single one of its atomic actions before executing them. I think this is unrealistic. (“Should I move this articulator arm .5 degrees clockwise?”)
2) or, expect the AI to conclude, through its own intelligence, that the programmers would want it to check in about some particular plan, P, before executing it. Presumably, the reason the AI would have for this checking-in would be that it sees that, as a result of its fallibility, there is a high chance that this course of action, P, might actually be unsatisfying to the programmers. But the point is that this checking-in is triggered by a specific concern the AI has about the risk to programmer satisfaction. This checking-in would not be triggered by plan Q that the AI didn’t have a reasonable concern was a risk to programmer satisfaction.
Do you agree with either of these options? Can you suggest alternatives?
Let me first address the way you phrased it before you gave me the two options.
After saying
My concern is that I don’t think this doctrine [of Logical Infallibility] is an essential part of the arguments or scenarios that Yudkowsky et al present.
you add:
An intelligent AI might come to a conclusion about what it ought to do, and then recognize “yes, I might be wrong about this” (whatever is meant by “wrong”—this is not at all clear).
The answer to this is that in all the scenarios I address in the paper—the scenarios invented by Yudkowsky and the rest—the AI is supposed to take an action in spite of the fact that it is getting ‴massive feedback‴ from all the humans on the planet, that they do not want this action to be executed. That is an important point: nobody is suggesting that these are really subtle fringe cases where the AI thinks that it might be wrong, but it is not sure—rather, the AI is supposed to go ahead and be unable to stop itself from carrying out the action in spite of clear protests from the humans.
That is the meaning of “wrong” here. And it is really easy to produce a good definition of “something going wrong” with the AI’s action plans, in cases like these: if there is an enormous inconsistency between descriptions of a world filled with happy humans (and here we can weigh into the scale a thousand books describing happiness in all its forms) and the fact that virtually every human on the planet reacts to the postulated situation by screaming his/her protests, then a million red flags should go up.
I think that when posed in this way, the question answers itself, no?
In other words, option 2 is close enough to what I meant, except that it is not exactly as a result of its fallibility that it hesitates (knowledge of fallibility is there as a background all the time), but rather due to the immediate fact that its proposed plan causes concern to people.
the AI is supposed to take an action in spite of the fact that it is getting ‴massive feedback‴ from all the humans on the planet, that they do not want this action to be executed.
I think the worry is at least threefold:
It might make unrecoverable mistakes, possibly by creating a subagent to complete some task that it cannot easily recall once it gets the negative feedback (think Mickey Mouse enchanting the broomstick in Fantasia, or, more realistically, an AI designing a self-replicating computer virus or nanobot swarm to accomplish some task, or the AI designing the future version of itself, that no longer cares about feedback).
It might have principled reasons to ignore that negative feedback. Think browser extensions that prevent you from visiting time-wasting sites, which might also prevent you from disabling them. “I’m doing this for your own good, like you asked me to!”
It might deliberately avoid receiving negative feedback. It may be difficult to correctly formulate the difference between “I want to believe correct ideas” and “I want to believe that my ideas are correct.”
I doubt that this list is exhaustive, and unfortunately it seems like they’re mutually reinforcing: if it has some principled reasons to devalue negative feedback, that will compound any weakness in its epistemic update procedure.
the fact that virtually every human on the planet reacts to the postulated situation by screaming his/her protests, then a million red flags should go up.
I am uncertain how much of this is an actual difference in belief between you and Yudkowsky, and how much of this is a communication difference. I think Yudkowsky is focusing on simple proposals with horrible effects, in order to point out that simplicity is insufficient, and jumps to knocking down individual proposals to try to establish the general trend that simplicity is dangerous. The more complex the safety mechanisms, the more subtle the eventual breakdown—with the hope that eventually we can get the breakdown subtle enough that it doesn’t occur!
(Most people aren’t very good deductive thinkers, but alright inductive thinkers—if you tell them “simple ideas are unsafe,” they are likely to think “well, except for my brilliant simple idea” instead of “hmm, that implies there’s something dangerous about my simple idea.” So I don’t think I disagree that Yudkowsky’s strategy’s was the right one, though it has its defects.)
Well, yes … but I think the scenarios you describe are becoming about different worries, not covered in the original brief.
It might make unrecoverable mistakes, possibly by creating a subagent to complete some task that it cannot easily recall once it gets the negative feedback (think Mickey Mouse enchanting the broomstick in Fantasia, or, more realistically, an AI designing a self-replicating computer virus or nanobot swarm to accomplish some task, or the AI designing the future version of itself, that no longer cares about feedback).
That one should come under the heading of “How come it started to do something drastic, before it even checked with anyone?”
In other words, if it unleashes the broomstick before checking to see if the consequences of doing so would be dire, then I am not sure but that we are now discussing simple, dumb mistakes on the part of the AI—because if it was not just a simple mistake, then the AI must have decided to circumvent the checking code, which I think everyone agrees is a baseline module that must be present.
It might have principled reasons to ignore that negative feedback. Think browser extensions that prevent you from visiting time-wasting sites, which might also prevent you from disabling them. “I’m doing this for your own good, like you asked me to!”
Well, you cite an example of a non-AI system (a browser) doing this, so we are back to the idea the AI could (for some reason) decide that there was a HIGHER directive, somewhere, that enabled it to justify ignoring the feedback. That goes back to the same point I just made: checking for consistency with humans’ professed opinions on the idea would be a sine qua non of any action.
Can I make a general point here? In analyzing the behavior of the AI I think it is very important to do a sanity check on every proposed scenario to make sure that it doesn’t fail the “Did I implicitly insert an extra supergoal?” test. In the paper I mentioned this at least once, I think—it came up in the context where I was asking about efficiency, because many people make statements about the AI that, if examined carefully, entail the existence of previously unmentioned supergoal ON TOP of the supergoal that was already supposed to be on top.
I think the scenarios you describe are becoming about different worries, not covered in the original brief.
Ah! That’s an interesting statement because of the last two paragraphs in the grandparent comment. I think that the root worry is the communication problem of transferring our values (or, at least, our meta-values) to the AI, and then having the AI convince us that it has correctly understood our values. I also think that worry is difficult to convey without specific, vivid examples.
For example, I see the Maverick Nanny as a rhetorical device targeting all simple value functions. It is not enough to ask that humans be “happy”—you must instead ask that humans be happy and give it a superintelligence-compatible definition of consent, or ask that humans be ”.”
I do agree with you that if you view the Maverick Nanny as a specific design proposal, then a relatively simple rule suffices to prevent that specific failure. Then there will a new least desirable allowed design, and if we only use a simple rule, that worst allowed design might still be horrible!
In other words, if it unleashes the broomstick before checking to see if the consequences of doing so would be dire, then I am not sure but that we are now discussing simple, dumb mistakes on the part of the AI—because if it was not just a simple mistake, then the AI must have decided to circumvent the checking code, which I think everyone agrees is a baseline module that must be present.
First, I suspect some people don’t yet see the point of checking code, and I’m not sure what you mean by “baseline.” Definitely it will be core to the design, but ‘baseline’ makes me think more of ‘default’ than ‘central,’ and the ‘default’ checking code is “does it compile?”, not “does it faithfully preserve the values of its creator?”
What I had in mind was the difference between value uncertainty (‘will I think this was a good purchase or not?’) and consequence uncertainty (‘if I click this button, will it be delivered by Friday?’), and the Fantasia example was unclear because it was meant only to highlight the unrecoverability of the mistake, when it also had a component of consequence uncertainty (Mickey was presumably unaware that one broomstick would turn to two).
That goes back to the same point I just made: checking for consistency with humans’ professed opinions on the idea would be a sine qua non of any action.
Would we want a police robot to stop arresting criminals because they asked it to not arrest them? A doctor robot to not vaccinate a child because they dislike needles or pain? If so, then “humans’ professed opinions” aren’t quite our sine qua non. Even if we say “well, in general, humans approve of enforcing laws, even if they might not want the laws they break enforced,” then we need to talk about what we mean by “in general”—is it an unweighted vote? Is it some sort of extrapolation process?
It seems reasonable to me to expect that an AGI grounded in principles might be more robust than an AGI grounded in the approval of humans. It’s one thing to have a concept of bodily autonomy and respect that; another thing to have humans convey their disapproval because you broke their concept of bodily autonomy. Among other things, the second approach is vulnerable to changes that happen too quickly for them to disapprove!
Can I make a general point here? In analyzing the behavior of the AI I think it is very important to do a sanity check on every proposed scenario to make sure that it doesn’t fail the “Did I implicitly insert an extra supergoal?” test.
I apologize for being unclear—I meant that we might have given it an explicit supergoal that outranks the negative feedback. In the specific case of the Maverick Nanny, told to “make people happy,” then if happiness is understood as chemical balance in the brain, people’s verbal protests and distress at the prospect of being edited are temporary problems that can also be solved through chemical means. If it is also told to “obtain consent,” then maybe it sneaks consent statements into lots of EULAs that people click through without reading. Unless you’ve managed to convey your entire sense of what is proper and what is not, there’s a risk of something improper but legal looking better than all proper solutions.
My question was about what criteria would cause the AI to make a proposal to the human supervisors before executing its plan. In this case, I don’t think the criteria can be that humans are objecting, since they haven’t heard its plan yet.
(Regarding the point that you’re only addressing the scenarios proposed by Yudkowsky et al, see my remark here .)
Why would the humans have “not heard the plan yet”? It is a no-brainer part of this AI’s design that part of the motivation engine (the goals) will be a goal that says “Check with the humans first.” The premise in the paper is that we are discussing an AI that was designed as best we could, BUT it then went maverick anyway: it makes no sense for us to switch, now, to talk about an AI that was actually built without that most elementary of safety precautions!
Quite independently, the AI can use its contextual understanding of the situation. Any intelligent system with such a poor understanding of the context and implications of its plans that it just goes ahead with the first plan off the stack, without thinking about implications, is an intelligent system that will walk out in front of a bus just because it wants to get to the other side of the road. In the case in question you are imagining an AI that would be capable of executing a plan to put all humans into bottles, without thinking for one moment to mention to anybody that it was considering this plan? That makes sense in any version of the real world. Such an AI is an implausible hypothetical.
With respect, your first point doesn’t answer my question. My question was, what criteria would cause the AI to submit a given proposed action or plan for human approval? You might say that the AI submits every proposed atomic action for approval (in this case, the criterion is the trivial one, “always submit proposal”), but this seems unlikely. Regardless, it doesn’t make sense to say the humans have already heard of the plan about which the AI is just now deciding whether to tell them.
In your second point you seem to be suggesting an answer to my question. (Correct me if I’m wrong.) You seem to be suggesting “context.” I’m not sure what is meant by this. Is it reasonable to suppose that the AI would make the decision about whether to “shoot first” or “ask first” based on things like, eg., the lower end of its 99% confidence interval for how satisfied its supervisors will be?
As you wrote, the second point filled in the missing part from the first: it uses its background contextual knowledge.
You say you are unsure what this means. That leaves me a little baffled, but here goes anyway. Suppose I asked a person, today, to write a book for me on the subject of “What counts as an action that is significant enough that, if you did that action in a way that it would affect people, it would rise above some level of “nontrivialness” and you should consult them first? Include in your answer a long discussion of the kind of thought processes you went through to come up with your answers” I know many articulate people who could, if they had the time, write a massive book on that subject.
Now, that book would contain a huge number of constraints (little factoids about the situation) about “significant actions”, and the SOURCE of that long list of constraints would be …. the background knowledge of the person who wrote the book. They would call upon a massive body of knowledge about many aspects of life, to organize their thoughts and come up with the book.
If we could look into the head of the person who wrote the book we could find that background knowledge. It would be similar in size to the number of constraints mentioned in the book, or it woudl be larger.
That background knowledge—both its content AND its structure—is what I refer to when I talk about the AI using contextual information or background knowledge to assess the degree of significance of an action.
You go on to ask a bizarre question:
Is it reasonable to suppose that the AI would make the decision about whether to “shoot first” or “ask first” based on things like, eg., the lower end of its 99% confidence interval for how satisfied its supervisors will be?
This would be an example of an intelligent system sitting there with that massive array of contextual/background knowledge that could be deployed …… but instead of using that knowledge to make a preliminary assessement of whether “shooting first” would be a good idea, it ignores ALL OF IT and substitutes one single constraint taken from its knowldege base or its goal system:
“Does this satisfy my criteria for how satisfied my supervisors will be?”
It would entirely defeat the object of using large numbers of constraints in the system, to use only one constraint. The system design is (assumed to be) such that this is impossible. That is the whole point of the Swarm Relaxation design that I talked about.
My bizarre question was just an illustrative example. It seems neither you nor I believe that would be an adequate criterion (though perhaps for different reasons).
If I may translate what you’re saying into my own terms, you’re saying that for a problem like “shoot first or ask first?” the criteria (i.e., constraints) would be highly complex and highly contextual. Ok. I’ll grant that’s a defensible design choice.
Earlier in the thread you said
the AI is supposed to take an action in spite of the fact that it is getting ‴massive feedback‴ from all the humans on the planet, that they do not want this action to be executed.
This is why I have honed in on scenarios where the AI has not yet received feedback on its plan. In these scenarios, the AI presumably must decide (even if the decision is only implicit) whether to consult humans about its plan first, or to go ahead with its plan first (and halt or change course in response to human feedback). To lay my cards on the table, I want to consider three possible policies the AI could have regarding this choice.
Always (or usually) consult first. We can rule this out as impractical, if the AI is making a large number of atomic actions.
Always (or usually) shoot first, and see what the response is. Unless the AI only makes friendly plans, I think this policy is catastrophic, since I believe there are many scenarios where an AI could initiate a plan and before we know what hit us we’re in an unrecoverably bad situation. Therefore, implementing this policy in a non-catastrophic way is FAI-complete.
Have some good critera for picking between “shoot first” or “ask first” on any given chunk of planning. This is what you seem to be favoring in your answer above. (Correct me if I’m wrong.) These criteria will tend to be complex, and not necessarily formulated internally in an axiomatic way. Regardless, I fear making good choices between “shoot first” or “ask first” is hard, even FAI-complete. Screw up once, and you are in a catastrophe like in case 2.
Can you let me know: have I understood you correctly? More importantly, do you agree with my framing of the dilemma for the AI? Do you agree with my assessment of the pitfalls of each of the 3 policies?
I am with you on your rejection of 1 and 2, if only because they are both framed as absolutes which ignore context.
And, yes, I do favor 3. However, you insert some extra wording that I don’t necessarily buy....
These criteria will tend to be complex, and not necessarily formulated internally in an axiomatic way.
You see, hidden in these words seems to be an understanding of how the AI is working, that might lead you to see a huge problem, and me to see something very different. I don’t know if this is really what you are thinking, but bear with me while I run with this for a moment.
Trying to formulate criteria for something, in an objective, ‘codified’ way, can sometimes be incredibly hard even when most people would say they have internal ‘judgement’ that allowed them to make a ruling very easily: the standard saw being “I cannot define what ‘pornography’ is, but I know it when I see it.” And (stepping quickly away from that example because I don’t want to get into that quagmire) there is a much more concrete example in the old interactive activation model of word recognition, which is a simple constraint system. In IAC, word recognition is remarkably robust in the face of noise, whereas attempts to write symbolic programs to deal with all the different kinds of noisy corruption of the image turn out to be horribly complex and faulty.
As you can see, I am once again pointing to the fact that Swarm Relaxation systems (understood in the very broad sense that allows all varieties of neural net to be included) can make criterial decisions seem easy, where explicit codification of the decision is a nightmare.
So, where does that lead to? Well, you go on to say:
Regardless, I fear making good choices between “shoot first” or “ask first” is hard, even FAI-complete. Screw up once, and you are in a catastrophe like in case 2.
The key phrase here is “Screw up once, and...”. In a constraint system it is impossible for one screw-up (one faulty constraint) to unbalance the whole system. That is the whole raison-d’etre of constraint systems.
Also, you say that the problem of making good choices might be FAI-complete. Now, I have some substantial quibbles with that whole “FAI-complete” idea, but in this case I will just ask a question: are you tring to say that in order to DESIGN the motivation system of the AI in such a way that it will not make one catastrophic choice between shoot-first and ask-first, we must FIRST build a FAI, because that is the only way we can get enough intelligence-horsepower applied to the problem? If so, why exactly would we need to? If the constraint system just cannot allow single failures to get out of control, we don’t need to specify every possible criterial decision in advance, we simply rely on context to do the heavy lifting, in perpetuity.
Put another way: the constraint-based AI IS the FAI already, and the reasons for thinking that it can deal with all the potentially troublesome cases have nothing to do with us anticipating every potential troublesome case, ahead of time.
--
Stepping back a moment, consider the following three kinds of case where the AI might have to make a decision.
1) An interstellar asteroid appears from nowhere, travelling at unthinkable speed, and it is going to make a direct hit on the Earth in one hour, with no possibility of survivors. The AI considers a plan in which it quietly euthanizes all life, on the grounds that any other option would lead to one hour of horror, followed by certain death.
2) The AI considers the Dopamine Drip plan.
3) The AI suddenly becomes aware that a rare, precious species of bird has become endangered and the only surviving pair is on a nature trail that is about to be filled with a gang of humans who have been planning a holiday on that trail for months. The gang is approaching the pair right now and one of the birds will die if frightened because it has a heart condition. One plan is to block the humans without explaining (until later), which will inconvenience them.
In all three cases there is a great deal of background information (constraints) that could be brought to bear, and if the AI is constraint-based, it will consider that information. People do this all the time.
In no case is there ONLY a small number of constraints (like, 2 or 3) that are relevant. Where the number of constraints is tiny, there is a chance for a “bad choice” to be made. In fact, I would argue that it is inconceivable that a decision would take place in a near-vacuum of constraints. The more significant the decision, the greater the number of constraints. The bird situation is without doubt the one that has the fewest, but it still involves a fistful of considerations. For this reason, we would expect that all major decisions—and especially the existential threat ones like 1 and 2 -- would involve a very large number of constraints indeed. It is this mass effect that is at the heart of claims that the constraint approach leads to AI that cannot get into bizarre reasoning episodes.
Finally, notice that in case 1, we are in a situation where (unlike case 2) many humans would say that there is no good decision.
I see a fair amount of back-and-forth where someone says “What about this?” and you say “I addressed that in several places; clearly you didn’t read it.” Unfortunately, while you may think you have addressed the various issues, I don’t think you did (and presumably your interlocutors don’t). Perhaps you will humor me in responding to my comment. Let me try and make the issue as sharp as possible by pointing out what I think is an out-and-out mistake made by you. In the section you call the heart of your argument, you say.
Yes, the outcome is clearly the result of a “programming error” (in some sense). However, you then ask how a superintelligent machine could ignore such an “inconsistency in its reasoning.” But a programming error is not the same thing as an inconsistency in reasoning.
Note: I want to test your argument (at least at first), so I would rather not get a response from you claiming I’ve failed to take into account other arguments or other evidence, therefore my objection is invalid. Let me propose that you either 1) dispute that this was, in fact, a mistake, 2) explain how I have misunderstood, 3) grant that it was a mistake, and reformulate the claim here, or 4) state that this claim is not necessary for your argument.
If you can help me understand this point, I would be happy to continue to engage.
I’ll walk you through it.
I did not claim (as you imply) that the fact of there being a programming error was what implied that there is “an inconsistency in its reasoning.” In the two paragraphs immediately before the one you quote (and, indeed, in that whole section), I explain that the system KNOWS that it is following these two imperatives:
1) Conclusions produced by my reasoning engine are always correct. [This is the Doctrine of Logical Infallibility]
2) I know that AGI reasoning engines in general, and mine in particular, sometimes come to incorrect conclusions that are the result of a failure in their design.
Or, paraphrasing this in the simplest possible way:
1) My reasoning engine is infallible.
2) My reasoning engine is fallible.
That, right there, is a flat-out contradiction between two of its core “beliefs”. It is not, as you state, that the existence of a programming error is evidence of inconsistency, it is the above pair of beliefs (engendered by the programming error) that constitute the inconsistency.
Does that help?
Human beings do pretty much the same thing all the time(minus the word “always”) and are able to function.
Thanks for replying. Yes it does help. My apologies. I think I misunderstood your argument initially. I confess I still don’t see how it works though.
You criticize the doctrine of logical infallibility, claiming that a truly intelligent AI would not believe such a thing. Maybe so. I’ll set the question aside for now. My concern is that I don’t think this doctrine is an essential part of the arguments or scenarios that Yudkowsky et al present.
An intelligent AI might come to a conclusion about what it ought to do, and then recognize “yes, I might be wrong about this” (whatever is meant by “wrong”—this is not at all clear). The AI might always recognize this possibility about every one of its conclusions. Still, so what? Does this mean it won’t act?
Can you tell me how you feel about the following two options? Or, if you prefer a third option, could you explain it? You could
1) explicitly program the AI to ask the programmers about every single one of its atomic actions before executing them. I think this is unrealistic. (“Should I move this articulator arm .5 degrees clockwise?”)
2) or, expect the AI to conclude, through its own intelligence, that the programmers would want it to check in about some particular plan, P, before executing it. Presumably, the reason the AI would have for this checking-in would be that it sees that, as a result of its fallibility, there is a high chance that this course of action, P, might actually be unsatisfying to the programmers. But the point is that this checking-in is triggered by a specific concern the AI has about the risk to programmer satisfaction. This checking-in would not be triggered by plan Q that the AI didn’t have a reasonable concern was a risk to programmer satisfaction.
Do you agree with either of these options? Can you suggest alternatives?
Let me first address the way you phrased it before you gave me the two options.
After saying
you add:
The answer to this is that in all the scenarios I address in the paper—the scenarios invented by Yudkowsky and the rest—the AI is supposed to take an action in spite of the fact that it is getting ‴massive feedback‴ from all the humans on the planet, that they do not want this action to be executed. That is an important point: nobody is suggesting that these are really subtle fringe cases where the AI thinks that it might be wrong, but it is not sure—rather, the AI is supposed to go ahead and be unable to stop itself from carrying out the action in spite of clear protests from the humans.
That is the meaning of “wrong” here. And it is really easy to produce a good definition of “something going wrong” with the AI’s action plans, in cases like these: if there is an enormous inconsistency between descriptions of a world filled with happy humans (and here we can weigh into the scale a thousand books describing happiness in all its forms) and the fact that virtually every human on the planet reacts to the postulated situation by screaming his/her protests, then a million red flags should go up.
I think that when posed in this way, the question answers itself, no?
In other words, option 2 is close enough to what I meant, except that it is not exactly as a result of its fallibility that it hesitates (knowledge of fallibility is there as a background all the time), but rather due to the immediate fact that its proposed plan causes concern to people.
I think the worry is at least threefold:
It might make unrecoverable mistakes, possibly by creating a subagent to complete some task that it cannot easily recall once it gets the negative feedback (think Mickey Mouse enchanting the broomstick in Fantasia, or, more realistically, an AI designing a self-replicating computer virus or nanobot swarm to accomplish some task, or the AI designing the future version of itself, that no longer cares about feedback).
It might have principled reasons to ignore that negative feedback. Think browser extensions that prevent you from visiting time-wasting sites, which might also prevent you from disabling them. “I’m doing this for your own good, like you asked me to!”
It might deliberately avoid receiving negative feedback. It may be difficult to correctly formulate the difference between “I want to believe correct ideas” and “I want to believe that my ideas are correct.”
I doubt that this list is exhaustive, and unfortunately it seems like they’re mutually reinforcing: if it has some principled reasons to devalue negative feedback, that will compound any weakness in its epistemic update procedure.
I am uncertain how much of this is an actual difference in belief between you and Yudkowsky, and how much of this is a communication difference. I think Yudkowsky is focusing on simple proposals with horrible effects, in order to point out that simplicity is insufficient, and jumps to knocking down individual proposals to try to establish the general trend that simplicity is dangerous. The more complex the safety mechanisms, the more subtle the eventual breakdown—with the hope that eventually we can get the breakdown subtle enough that it doesn’t occur!
(Most people aren’t very good deductive thinkers, but alright inductive thinkers—if you tell them “simple ideas are unsafe,” they are likely to think “well, except for my brilliant simple idea” instead of “hmm, that implies there’s something dangerous about my simple idea.” So I don’t think I disagree that Yudkowsky’s strategy’s was the right one, though it has its defects.)
Well, yes … but I think the scenarios you describe are becoming about different worries, not covered in the original brief.
That one should come under the heading of “How come it started to do something drastic, before it even checked with anyone?”
In other words, if it unleashes the broomstick before checking to see if the consequences of doing so would be dire, then I am not sure but that we are now discussing simple, dumb mistakes on the part of the AI—because if it was not just a simple mistake, then the AI must have decided to circumvent the checking code, which I think everyone agrees is a baseline module that must be present.
Well, you cite an example of a non-AI system (a browser) doing this, so we are back to the idea the AI could (for some reason) decide that there was a HIGHER directive, somewhere, that enabled it to justify ignoring the feedback. That goes back to the same point I just made: checking for consistency with humans’ professed opinions on the idea would be a sine qua non of any action.
Can I make a general point here? In analyzing the behavior of the AI I think it is very important to do a sanity check on every proposed scenario to make sure that it doesn’t fail the “Did I implicitly insert an extra supergoal?” test. In the paper I mentioned this at least once, I think—it came up in the context where I was asking about efficiency, because many people make statements about the AI that, if examined carefully, entail the existence of previously unmentioned supergoal ON TOP of the supergoal that was already supposed to be on top.
Ah! That’s an interesting statement because of the last two paragraphs in the grandparent comment. I think that the root worry is the communication problem of transferring our values (or, at least, our meta-values) to the AI, and then having the AI convince us that it has correctly understood our values. I also think that worry is difficult to convey without specific, vivid examples.
For example, I see the Maverick Nanny as a rhetorical device targeting all simple value functions. It is not enough to ask that humans be “happy”—you must instead ask that humans be happy and give it a superintelligence-compatible definition of consent, or ask that humans be ”.”
I do agree with you that if you view the Maverick Nanny as a specific design proposal, then a relatively simple rule suffices to prevent that specific failure. Then there will a new least desirable allowed design, and if we only use a simple rule, that worst allowed design might still be horrible!
First, I suspect some people don’t yet see the point of checking code, and I’m not sure what you mean by “baseline.” Definitely it will be core to the design, but ‘baseline’ makes me think more of ‘default’ than ‘central,’ and the ‘default’ checking code is “does it compile?”, not “does it faithfully preserve the values of its creator?”
What I had in mind was the difference between value uncertainty (‘will I think this was a good purchase or not?’) and consequence uncertainty (‘if I click this button, will it be delivered by Friday?’), and the Fantasia example was unclear because it was meant only to highlight the unrecoverability of the mistake, when it also had a component of consequence uncertainty (Mickey was presumably unaware that one broomstick would turn to two).
Would we want a police robot to stop arresting criminals because they asked it to not arrest them? A doctor robot to not vaccinate a child because they dislike needles or pain? If so, then “humans’ professed opinions” aren’t quite our sine qua non. Even if we say “well, in general, humans approve of enforcing laws, even if they might not want the laws they break enforced,” then we need to talk about what we mean by “in general”—is it an unweighted vote? Is it some sort of extrapolation process?
It seems reasonable to me to expect that an AGI grounded in principles might be more robust than an AGI grounded in the approval of humans. It’s one thing to have a concept of bodily autonomy and respect that; another thing to have humans convey their disapproval because you broke their concept of bodily autonomy. Among other things, the second approach is vulnerable to changes that happen too quickly for them to disapprove!
I apologize for being unclear—I meant that we might have given it an explicit supergoal that outranks the negative feedback. In the specific case of the Maverick Nanny, told to “make people happy,” then if happiness is understood as chemical balance in the brain, people’s verbal protests and distress at the prospect of being edited are temporary problems that can also be solved through chemical means. If it is also told to “obtain consent,” then maybe it sneaks consent statements into lots of EULAs that people click through without reading. Unless you’ve managed to convey your entire sense of what is proper and what is not, there’s a risk of something improper but legal looking better than all proper solutions.
My question was about what criteria would cause the AI to make a proposal to the human supervisors before executing its plan. In this case, I don’t think the criteria can be that humans are objecting, since they haven’t heard its plan yet.
(Regarding the point that you’re only addressing the scenarios proposed by Yudkowsky et al, see my remark here .)
That is easy:
Why would the humans have “not heard the plan yet”? It is a no-brainer part of this AI’s design that part of the motivation engine (the goals) will be a goal that says “Check with the humans first.” The premise in the paper is that we are discussing an AI that was designed as best we could, BUT it then went maverick anyway: it makes no sense for us to switch, now, to talk about an AI that was actually built without that most elementary of safety precautions!
Quite independently, the AI can use its contextual understanding of the situation. Any intelligent system with such a poor understanding of the context and implications of its plans that it just goes ahead with the first plan off the stack, without thinking about implications, is an intelligent system that will walk out in front of a bus just because it wants to get to the other side of the road. In the case in question you are imagining an AI that would be capable of executing a plan to put all humans into bottles, without thinking for one moment to mention to anybody that it was considering this plan? That makes sense in any version of the real world. Such an AI is an implausible hypothetical.
With respect, your first point doesn’t answer my question. My question was, what criteria would cause the AI to submit a given proposed action or plan for human approval? You might say that the AI submits every proposed atomic action for approval (in this case, the criterion is the trivial one, “always submit proposal”), but this seems unlikely. Regardless, it doesn’t make sense to say the humans have already heard of the plan about which the AI is just now deciding whether to tell them.
In your second point you seem to be suggesting an answer to my question. (Correct me if I’m wrong.) You seem to be suggesting “context.” I’m not sure what is meant by this. Is it reasonable to suppose that the AI would make the decision about whether to “shoot first” or “ask first” based on things like, eg., the lower end of its 99% confidence interval for how satisfied its supervisors will be?
As you wrote, the second point filled in the missing part from the first: it uses its background contextual knowledge.
You say you are unsure what this means. That leaves me a little baffled, but here goes anyway. Suppose I asked a person, today, to write a book for me on the subject of “What counts as an action that is significant enough that, if you did that action in a way that it would affect people, it would rise above some level of “nontrivialness” and you should consult them first? Include in your answer a long discussion of the kind of thought processes you went through to come up with your answers” I know many articulate people who could, if they had the time, write a massive book on that subject.
Now, that book would contain a huge number of constraints (little factoids about the situation) about “significant actions”, and the SOURCE of that long list of constraints would be …. the background knowledge of the person who wrote the book. They would call upon a massive body of knowledge about many aspects of life, to organize their thoughts and come up with the book.
If we could look into the head of the person who wrote the book we could find that background knowledge. It would be similar in size to the number of constraints mentioned in the book, or it woudl be larger.
That background knowledge—both its content AND its structure—is what I refer to when I talk about the AI using contextual information or background knowledge to assess the degree of significance of an action.
You go on to ask a bizarre question:
This would be an example of an intelligent system sitting there with that massive array of contextual/background knowledge that could be deployed …… but instead of using that knowledge to make a preliminary assessement of whether “shooting first” would be a good idea, it ignores ALL OF IT and substitutes one single constraint taken from its knowldege base or its goal system:
It would entirely defeat the object of using large numbers of constraints in the system, to use only one constraint. The system design is (assumed to be) such that this is impossible. That is the whole point of the Swarm Relaxation design that I talked about.
My bizarre question was just an illustrative example. It seems neither you nor I believe that would be an adequate criterion (though perhaps for different reasons).
If I may translate what you’re saying into my own terms, you’re saying that for a problem like “shoot first or ask first?” the criteria (i.e., constraints) would be highly complex and highly contextual. Ok. I’ll grant that’s a defensible design choice.
Earlier in the thread you said
This is why I have honed in on scenarios where the AI has not yet received feedback on its plan. In these scenarios, the AI presumably must decide (even if the decision is only implicit) whether to consult humans about its plan first, or to go ahead with its plan first (and halt or change course in response to human feedback). To lay my cards on the table, I want to consider three possible policies the AI could have regarding this choice.
Always (or usually) consult first. We can rule this out as impractical, if the AI is making a large number of atomic actions.
Always (or usually) shoot first, and see what the response is. Unless the AI only makes friendly plans, I think this policy is catastrophic, since I believe there are many scenarios where an AI could initiate a plan and before we know what hit us we’re in an unrecoverably bad situation. Therefore, implementing this policy in a non-catastrophic way is FAI-complete.
Have some good critera for picking between “shoot first” or “ask first” on any given chunk of planning. This is what you seem to be favoring in your answer above. (Correct me if I’m wrong.) These criteria will tend to be complex, and not necessarily formulated internally in an axiomatic way. Regardless, I fear making good choices between “shoot first” or “ask first” is hard, even FAI-complete. Screw up once, and you are in a catastrophe like in case 2.
Can you let me know: have I understood you correctly? More importantly, do you agree with my framing of the dilemma for the AI? Do you agree with my assessment of the pitfalls of each of the 3 policies?
I am with you on your rejection of 1 and 2, if only because they are both framed as absolutes which ignore context.
And, yes, I do favor 3. However, you insert some extra wording that I don’t necessarily buy....
You see, hidden in these words seems to be an understanding of how the AI is working, that might lead you to see a huge problem, and me to see something very different. I don’t know if this is really what you are thinking, but bear with me while I run with this for a moment.
Trying to formulate criteria for something, in an objective, ‘codified’ way, can sometimes be incredibly hard even when most people would say they have internal ‘judgement’ that allowed them to make a ruling very easily: the standard saw being “I cannot define what ‘pornography’ is, but I know it when I see it.” And (stepping quickly away from that example because I don’t want to get into that quagmire) there is a much more concrete example in the old interactive activation model of word recognition, which is a simple constraint system. In IAC, word recognition is remarkably robust in the face of noise, whereas attempts to write symbolic programs to deal with all the different kinds of noisy corruption of the image turn out to be horribly complex and faulty.
As you can see, I am once again pointing to the fact that Swarm Relaxation systems (understood in the very broad sense that allows all varieties of neural net to be included) can make criterial decisions seem easy, where explicit codification of the decision is a nightmare.
So, where does that lead to? Well, you go on to say:
The key phrase here is “Screw up once, and...”. In a constraint system it is impossible for one screw-up (one faulty constraint) to unbalance the whole system. That is the whole raison-d’etre of constraint systems.
Also, you say that the problem of making good choices might be FAI-complete. Now, I have some substantial quibbles with that whole “FAI-complete” idea, but in this case I will just ask a question: are you tring to say that in order to DESIGN the motivation system of the AI in such a way that it will not make one catastrophic choice between shoot-first and ask-first, we must FIRST build a FAI, because that is the only way we can get enough intelligence-horsepower applied to the problem? If so, why exactly would we need to? If the constraint system just cannot allow single failures to get out of control, we don’t need to specify every possible criterial decision in advance, we simply rely on context to do the heavy lifting, in perpetuity.
Put another way: the constraint-based AI IS the FAI already, and the reasons for thinking that it can deal with all the potentially troublesome cases have nothing to do with us anticipating every potential troublesome case, ahead of time.
--
Stepping back a moment, consider the following three kinds of case where the AI might have to make a decision.
1) An interstellar asteroid appears from nowhere, travelling at unthinkable speed, and it is going to make a direct hit on the Earth in one hour, with no possibility of survivors. The AI considers a plan in which it quietly euthanizes all life, on the grounds that any other option would lead to one hour of horror, followed by certain death.
2) The AI considers the Dopamine Drip plan.
3) The AI suddenly becomes aware that a rare, precious species of bird has become endangered and the only surviving pair is on a nature trail that is about to be filled with a gang of humans who have been planning a holiday on that trail for months. The gang is approaching the pair right now and one of the birds will die if frightened because it has a heart condition. One plan is to block the humans without explaining (until later), which will inconvenience them.
In all three cases there is a great deal of background information (constraints) that could be brought to bear, and if the AI is constraint-based, it will consider that information. People do this all the time.
In no case is there ONLY a small number of constraints (like, 2 or 3) that are relevant. Where the number of constraints is tiny, there is a chance for a “bad choice” to be made. In fact, I would argue that it is inconceivable that a decision would take place in a near-vacuum of constraints. The more significant the decision, the greater the number of constraints. The bird situation is without doubt the one that has the fewest, but it still involves a fistful of considerations. For this reason, we would expect that all major decisions—and especially the existential threat ones like 1 and 2 -- would involve a very large number of constraints indeed. It is this mass effect that is at the heart of claims that the constraint approach leads to AI that cannot get into bizarre reasoning episodes.
Finally, notice that in case 1, we are in a situation where (unlike case 2) many humans would say that there is no good decision.