I think the scenarios you describe are becoming about different worries, not covered in the original brief.
Ah! That’s an interesting statement because of the last two paragraphs in the grandparent comment. I think that the root worry is the communication problem of transferring our values (or, at least, our meta-values) to the AI, and then having the AI convince us that it has correctly understood our values. I also think that worry is difficult to convey without specific, vivid examples.
For example, I see the Maverick Nanny as a rhetorical device targeting all simple value functions. It is not enough to ask that humans be “happy”—you must instead ask that humans be happy and give it a superintelligence-compatible definition of consent, or ask that humans be ”.”
I do agree with you that if you view the Maverick Nanny as a specific design proposal, then a relatively simple rule suffices to prevent that specific failure. Then there will a new least desirable allowed design, and if we only use a simple rule, that worst allowed design might still be horrible!
In other words, if it unleashes the broomstick before checking to see if the consequences of doing so would be dire, then I am not sure but that we are now discussing simple, dumb mistakes on the part of the AI—because if it was not just a simple mistake, then the AI must have decided to circumvent the checking code, which I think everyone agrees is a baseline module that must be present.
First, I suspect some people don’t yet see the point of checking code, and I’m not sure what you mean by “baseline.” Definitely it will be core to the design, but ‘baseline’ makes me think more of ‘default’ than ‘central,’ and the ‘default’ checking code is “does it compile?”, not “does it faithfully preserve the values of its creator?”
What I had in mind was the difference between value uncertainty (‘will I think this was a good purchase or not?’) and consequence uncertainty (‘if I click this button, will it be delivered by Friday?’), and the Fantasia example was unclear because it was meant only to highlight the unrecoverability of the mistake, when it also had a component of consequence uncertainty (Mickey was presumably unaware that one broomstick would turn to two).
That goes back to the same point I just made: checking for consistency with humans’ professed opinions on the idea would be a sine qua non of any action.
Would we want a police robot to stop arresting criminals because they asked it to not arrest them? A doctor robot to not vaccinate a child because they dislike needles or pain? If so, then “humans’ professed opinions” aren’t quite our sine qua non. Even if we say “well, in general, humans approve of enforcing laws, even if they might not want the laws they break enforced,” then we need to talk about what we mean by “in general”—is it an unweighted vote? Is it some sort of extrapolation process?
It seems reasonable to me to expect that an AGI grounded in principles might be more robust than an AGI grounded in the approval of humans. It’s one thing to have a concept of bodily autonomy and respect that; another thing to have humans convey their disapproval because you broke their concept of bodily autonomy. Among other things, the second approach is vulnerable to changes that happen too quickly for them to disapprove!
Can I make a general point here? In analyzing the behavior of the AI I think it is very important to do a sanity check on every proposed scenario to make sure that it doesn’t fail the “Did I implicitly insert an extra supergoal?” test.
I apologize for being unclear—I meant that we might have given it an explicit supergoal that outranks the negative feedback. In the specific case of the Maverick Nanny, told to “make people happy,” then if happiness is understood as chemical balance in the brain, people’s verbal protests and distress at the prospect of being edited are temporary problems that can also be solved through chemical means. If it is also told to “obtain consent,” then maybe it sneaks consent statements into lots of EULAs that people click through without reading. Unless you’ve managed to convey your entire sense of what is proper and what is not, there’s a risk of something improper but legal looking better than all proper solutions.
Ah! That’s an interesting statement because of the last two paragraphs in the grandparent comment. I think that the root worry is the communication problem of transferring our values (or, at least, our meta-values) to the AI, and then having the AI convince us that it has correctly understood our values. I also think that worry is difficult to convey without specific, vivid examples.
For example, I see the Maverick Nanny as a rhetorical device targeting all simple value functions. It is not enough to ask that humans be “happy”—you must instead ask that humans be happy and give it a superintelligence-compatible definition of consent, or ask that humans be ”.”
I do agree with you that if you view the Maverick Nanny as a specific design proposal, then a relatively simple rule suffices to prevent that specific failure. Then there will a new least desirable allowed design, and if we only use a simple rule, that worst allowed design might still be horrible!
First, I suspect some people don’t yet see the point of checking code, and I’m not sure what you mean by “baseline.” Definitely it will be core to the design, but ‘baseline’ makes me think more of ‘default’ than ‘central,’ and the ‘default’ checking code is “does it compile?”, not “does it faithfully preserve the values of its creator?”
What I had in mind was the difference between value uncertainty (‘will I think this was a good purchase or not?’) and consequence uncertainty (‘if I click this button, will it be delivered by Friday?’), and the Fantasia example was unclear because it was meant only to highlight the unrecoverability of the mistake, when it also had a component of consequence uncertainty (Mickey was presumably unaware that one broomstick would turn to two).
Would we want a police robot to stop arresting criminals because they asked it to not arrest them? A doctor robot to not vaccinate a child because they dislike needles or pain? If so, then “humans’ professed opinions” aren’t quite our sine qua non. Even if we say “well, in general, humans approve of enforcing laws, even if they might not want the laws they break enforced,” then we need to talk about what we mean by “in general”—is it an unweighted vote? Is it some sort of extrapolation process?
It seems reasonable to me to expect that an AGI grounded in principles might be more robust than an AGI grounded in the approval of humans. It’s one thing to have a concept of bodily autonomy and respect that; another thing to have humans convey their disapproval because you broke their concept of bodily autonomy. Among other things, the second approach is vulnerable to changes that happen too quickly for them to disapprove!
I apologize for being unclear—I meant that we might have given it an explicit supergoal that outranks the negative feedback. In the specific case of the Maverick Nanny, told to “make people happy,” then if happiness is understood as chemical balance in the brain, people’s verbal protests and distress at the prospect of being edited are temporary problems that can also be solved through chemical means. If it is also told to “obtain consent,” then maybe it sneaks consent statements into lots of EULAs that people click through without reading. Unless you’ve managed to convey your entire sense of what is proper and what is not, there’s a risk of something improper but legal looking better than all proper solutions.