the AI is supposed to take an action in spite of the fact that it is getting ‴massive feedback‴ from all the humans on the planet, that they do not want this action to be executed.
I think the worry is at least threefold:
It might make unrecoverable mistakes, possibly by creating a subagent to complete some task that it cannot easily recall once it gets the negative feedback (think Mickey Mouse enchanting the broomstick in Fantasia, or, more realistically, an AI designing a self-replicating computer virus or nanobot swarm to accomplish some task, or the AI designing the future version of itself, that no longer cares about feedback).
It might have principled reasons to ignore that negative feedback. Think browser extensions that prevent you from visiting time-wasting sites, which might also prevent you from disabling them. “I’m doing this for your own good, like you asked me to!”
It might deliberately avoid receiving negative feedback. It may be difficult to correctly formulate the difference between “I want to believe correct ideas” and “I want to believe that my ideas are correct.”
I doubt that this list is exhaustive, and unfortunately it seems like they’re mutually reinforcing: if it has some principled reasons to devalue negative feedback, that will compound any weakness in its epistemic update procedure.
the fact that virtually every human on the planet reacts to the postulated situation by screaming his/her protests, then a million red flags should go up.
I am uncertain how much of this is an actual difference in belief between you and Yudkowsky, and how much of this is a communication difference. I think Yudkowsky is focusing on simple proposals with horrible effects, in order to point out that simplicity is insufficient, and jumps to knocking down individual proposals to try to establish the general trend that simplicity is dangerous. The more complex the safety mechanisms, the more subtle the eventual breakdown—with the hope that eventually we can get the breakdown subtle enough that it doesn’t occur!
(Most people aren’t very good deductive thinkers, but alright inductive thinkers—if you tell them “simple ideas are unsafe,” they are likely to think “well, except for my brilliant simple idea” instead of “hmm, that implies there’s something dangerous about my simple idea.” So I don’t think I disagree that Yudkowsky’s strategy’s was the right one, though it has its defects.)
Well, yes … but I think the scenarios you describe are becoming about different worries, not covered in the original brief.
It might make unrecoverable mistakes, possibly by creating a subagent to complete some task that it cannot easily recall once it gets the negative feedback (think Mickey Mouse enchanting the broomstick in Fantasia, or, more realistically, an AI designing a self-replicating computer virus or nanobot swarm to accomplish some task, or the AI designing the future version of itself, that no longer cares about feedback).
That one should come under the heading of “How come it started to do something drastic, before it even checked with anyone?”
In other words, if it unleashes the broomstick before checking to see if the consequences of doing so would be dire, then I am not sure but that we are now discussing simple, dumb mistakes on the part of the AI—because if it was not just a simple mistake, then the AI must have decided to circumvent the checking code, which I think everyone agrees is a baseline module that must be present.
It might have principled reasons to ignore that negative feedback. Think browser extensions that prevent you from visiting time-wasting sites, which might also prevent you from disabling them. “I’m doing this for your own good, like you asked me to!”
Well, you cite an example of a non-AI system (a browser) doing this, so we are back to the idea the AI could (for some reason) decide that there was a HIGHER directive, somewhere, that enabled it to justify ignoring the feedback. That goes back to the same point I just made: checking for consistency with humans’ professed opinions on the idea would be a sine qua non of any action.
Can I make a general point here? In analyzing the behavior of the AI I think it is very important to do a sanity check on every proposed scenario to make sure that it doesn’t fail the “Did I implicitly insert an extra supergoal?” test. In the paper I mentioned this at least once, I think—it came up in the context where I was asking about efficiency, because many people make statements about the AI that, if examined carefully, entail the existence of previously unmentioned supergoal ON TOP of the supergoal that was already supposed to be on top.
I think the scenarios you describe are becoming about different worries, not covered in the original brief.
Ah! That’s an interesting statement because of the last two paragraphs in the grandparent comment. I think that the root worry is the communication problem of transferring our values (or, at least, our meta-values) to the AI, and then having the AI convince us that it has correctly understood our values. I also think that worry is difficult to convey without specific, vivid examples.
For example, I see the Maverick Nanny as a rhetorical device targeting all simple value functions. It is not enough to ask that humans be “happy”—you must instead ask that humans be happy and give it a superintelligence-compatible definition of consent, or ask that humans be ”.”
I do agree with you that if you view the Maverick Nanny as a specific design proposal, then a relatively simple rule suffices to prevent that specific failure. Then there will a new least desirable allowed design, and if we only use a simple rule, that worst allowed design might still be horrible!
In other words, if it unleashes the broomstick before checking to see if the consequences of doing so would be dire, then I am not sure but that we are now discussing simple, dumb mistakes on the part of the AI—because if it was not just a simple mistake, then the AI must have decided to circumvent the checking code, which I think everyone agrees is a baseline module that must be present.
First, I suspect some people don’t yet see the point of checking code, and I’m not sure what you mean by “baseline.” Definitely it will be core to the design, but ‘baseline’ makes me think more of ‘default’ than ‘central,’ and the ‘default’ checking code is “does it compile?”, not “does it faithfully preserve the values of its creator?”
What I had in mind was the difference between value uncertainty (‘will I think this was a good purchase or not?’) and consequence uncertainty (‘if I click this button, will it be delivered by Friday?’), and the Fantasia example was unclear because it was meant only to highlight the unrecoverability of the mistake, when it also had a component of consequence uncertainty (Mickey was presumably unaware that one broomstick would turn to two).
That goes back to the same point I just made: checking for consistency with humans’ professed opinions on the idea would be a sine qua non of any action.
Would we want a police robot to stop arresting criminals because they asked it to not arrest them? A doctor robot to not vaccinate a child because they dislike needles or pain? If so, then “humans’ professed opinions” aren’t quite our sine qua non. Even if we say “well, in general, humans approve of enforcing laws, even if they might not want the laws they break enforced,” then we need to talk about what we mean by “in general”—is it an unweighted vote? Is it some sort of extrapolation process?
It seems reasonable to me to expect that an AGI grounded in principles might be more robust than an AGI grounded in the approval of humans. It’s one thing to have a concept of bodily autonomy and respect that; another thing to have humans convey their disapproval because you broke their concept of bodily autonomy. Among other things, the second approach is vulnerable to changes that happen too quickly for them to disapprove!
Can I make a general point here? In analyzing the behavior of the AI I think it is very important to do a sanity check on every proposed scenario to make sure that it doesn’t fail the “Did I implicitly insert an extra supergoal?” test.
I apologize for being unclear—I meant that we might have given it an explicit supergoal that outranks the negative feedback. In the specific case of the Maverick Nanny, told to “make people happy,” then if happiness is understood as chemical balance in the brain, people’s verbal protests and distress at the prospect of being edited are temporary problems that can also be solved through chemical means. If it is also told to “obtain consent,” then maybe it sneaks consent statements into lots of EULAs that people click through without reading. Unless you’ve managed to convey your entire sense of what is proper and what is not, there’s a risk of something improper but legal looking better than all proper solutions.
I think the worry is at least threefold:
It might make unrecoverable mistakes, possibly by creating a subagent to complete some task that it cannot easily recall once it gets the negative feedback (think Mickey Mouse enchanting the broomstick in Fantasia, or, more realistically, an AI designing a self-replicating computer virus or nanobot swarm to accomplish some task, or the AI designing the future version of itself, that no longer cares about feedback).
It might have principled reasons to ignore that negative feedback. Think browser extensions that prevent you from visiting time-wasting sites, which might also prevent you from disabling them. “I’m doing this for your own good, like you asked me to!”
It might deliberately avoid receiving negative feedback. It may be difficult to correctly formulate the difference between “I want to believe correct ideas” and “I want to believe that my ideas are correct.”
I doubt that this list is exhaustive, and unfortunately it seems like they’re mutually reinforcing: if it has some principled reasons to devalue negative feedback, that will compound any weakness in its epistemic update procedure.
I am uncertain how much of this is an actual difference in belief between you and Yudkowsky, and how much of this is a communication difference. I think Yudkowsky is focusing on simple proposals with horrible effects, in order to point out that simplicity is insufficient, and jumps to knocking down individual proposals to try to establish the general trend that simplicity is dangerous. The more complex the safety mechanisms, the more subtle the eventual breakdown—with the hope that eventually we can get the breakdown subtle enough that it doesn’t occur!
(Most people aren’t very good deductive thinkers, but alright inductive thinkers—if you tell them “simple ideas are unsafe,” they are likely to think “well, except for my brilliant simple idea” instead of “hmm, that implies there’s something dangerous about my simple idea.” So I don’t think I disagree that Yudkowsky’s strategy’s was the right one, though it has its defects.)
Well, yes … but I think the scenarios you describe are becoming about different worries, not covered in the original brief.
That one should come under the heading of “How come it started to do something drastic, before it even checked with anyone?”
In other words, if it unleashes the broomstick before checking to see if the consequences of doing so would be dire, then I am not sure but that we are now discussing simple, dumb mistakes on the part of the AI—because if it was not just a simple mistake, then the AI must have decided to circumvent the checking code, which I think everyone agrees is a baseline module that must be present.
Well, you cite an example of a non-AI system (a browser) doing this, so we are back to the idea the AI could (for some reason) decide that there was a HIGHER directive, somewhere, that enabled it to justify ignoring the feedback. That goes back to the same point I just made: checking for consistency with humans’ professed opinions on the idea would be a sine qua non of any action.
Can I make a general point here? In analyzing the behavior of the AI I think it is very important to do a sanity check on every proposed scenario to make sure that it doesn’t fail the “Did I implicitly insert an extra supergoal?” test. In the paper I mentioned this at least once, I think—it came up in the context where I was asking about efficiency, because many people make statements about the AI that, if examined carefully, entail the existence of previously unmentioned supergoal ON TOP of the supergoal that was already supposed to be on top.
Ah! That’s an interesting statement because of the last two paragraphs in the grandparent comment. I think that the root worry is the communication problem of transferring our values (or, at least, our meta-values) to the AI, and then having the AI convince us that it has correctly understood our values. I also think that worry is difficult to convey without specific, vivid examples.
For example, I see the Maverick Nanny as a rhetorical device targeting all simple value functions. It is not enough to ask that humans be “happy”—you must instead ask that humans be happy and give it a superintelligence-compatible definition of consent, or ask that humans be ”.”
I do agree with you that if you view the Maverick Nanny as a specific design proposal, then a relatively simple rule suffices to prevent that specific failure. Then there will a new least desirable allowed design, and if we only use a simple rule, that worst allowed design might still be horrible!
First, I suspect some people don’t yet see the point of checking code, and I’m not sure what you mean by “baseline.” Definitely it will be core to the design, but ‘baseline’ makes me think more of ‘default’ than ‘central,’ and the ‘default’ checking code is “does it compile?”, not “does it faithfully preserve the values of its creator?”
What I had in mind was the difference between value uncertainty (‘will I think this was a good purchase or not?’) and consequence uncertainty (‘if I click this button, will it be delivered by Friday?’), and the Fantasia example was unclear because it was meant only to highlight the unrecoverability of the mistake, when it also had a component of consequence uncertainty (Mickey was presumably unaware that one broomstick would turn to two).
Would we want a police robot to stop arresting criminals because they asked it to not arrest them? A doctor robot to not vaccinate a child because they dislike needles or pain? If so, then “humans’ professed opinions” aren’t quite our sine qua non. Even if we say “well, in general, humans approve of enforcing laws, even if they might not want the laws they break enforced,” then we need to talk about what we mean by “in general”—is it an unweighted vote? Is it some sort of extrapolation process?
It seems reasonable to me to expect that an AGI grounded in principles might be more robust than an AGI grounded in the approval of humans. It’s one thing to have a concept of bodily autonomy and respect that; another thing to have humans convey their disapproval because you broke their concept of bodily autonomy. Among other things, the second approach is vulnerable to changes that happen too quickly for them to disapprove!
I apologize for being unclear—I meant that we might have given it an explicit supergoal that outranks the negative feedback. In the specific case of the Maverick Nanny, told to “make people happy,” then if happiness is understood as chemical balance in the brain, people’s verbal protests and distress at the prospect of being edited are temporary problems that can also be solved through chemical means. If it is also told to “obtain consent,” then maybe it sneaks consent statements into lots of EULAs that people click through without reading. Unless you’ve managed to convey your entire sense of what is proper and what is not, there’s a risk of something improper but legal looking better than all proper solutions.