I agree (with this question) - what makes us so sure that “maximize paperclips” is the part of the utility function that the optimizer will really value? Couldn’t it symmetrically decide that “maximize paperclips” is a constraint on “try not to murder everyone”?
Asking what it really values is anthropomorphic. It’s not coming up with loopholes around the “don’t murder” people constraint because it doesn’t really value it, or because the paperclip part is its “real” motive.
It will probably come up with loopholes around the “maximize paperclips” constraint too—for example, if “paperclip” is defined by something paperclip-shaped, it will probably create atomic-scale nanoclips because these are easier to build than full-scale human-sized ones, much to the annoyance of the office-supply company that built it.
But paperclips are pretty simple. Add a few extra constraints and you can probably specify “paperclip” to a degree that makes them useful for office supplies.
Human values are really complex. “Don’t murder” doesn’t capture human values at all—if Clippy encases us in carbonite so that we’re still technically alive but not around to interfere with paperclip production, ve has fulfilled the “don’t murder” imperative, but we would count this as a fail. This is not Clippy’s “fault” for deliberately trying to “get around” the anti-murder constraint, it’s our “fault” for telling ver “don’t murder” when we really meant “don’t do anything bad”.
Building a genuine “respect” and “love” for the “don’t murder” constraint in Clippy wouldn’t help an iota against the carbonite scenario, because that’s not murder and we forgot to tell ver there should be a constraint against that too.
So you might ask: okay, but surely there are a finite number of constraints that capture what we want. Just build an AI with a thousand or ten thousand constraints, “don’t murder”, “don’t encase people in carbonite”, “don’t eat puppies”, etc., make sure the list is exhaustive and that’ll do it.
The first objection is that we might miss something. If the ancient Romans had made such a list, they might have forgotten “Don’t release damaging radiation that gives us cancer.” They certainly would have missed “Don’t enslave people”, because they were still enslaving people themselves—but this would mean it would be impossible to update the Roman AI for moral progress a few centuries down the line.
The second objection is that human morality isn’t just a system of constraints. Even if we could tell Clippy “Limit your activities to the Andromeda Galaxy and send us the finished clips” (which I think would still be dangerous), any more interesting AI that is going to interact with and help humans needs to realize that sometimes it is okay to engage in prohibited actions if they serve greater goals (for example, it can disable a crazed gunman to prevent a massacre, even though disabling people is usually verboten).
So to actually capture all possible constraints, and to capture the situations in which those constraints can and can’t be relaxed, we need to program all human values in. In that case we can just tell Clippy “Make paperclips in a way that doesn’t cause what we would classify as a horrifying catastrophe” and ve’ll say “Okay!” and not give us any trouble.
They certainly would have missed “Don’t enslave people”, because they were still enslaving people themselves—but this would mean it would be impossible to update the Roman AI for moral progress a few centuries down the line.
Historical notes. The Romans had laws against enslaving the free-born and also allowed manumission.
Thanks, this all makes sense and I agree. Asking what it “really” values was intentionally anthropomorphic, as I was asking about what “it will want to work around constraints” really meant in practical terms, a claim which I believe was made by others.
I’m totally on board with “we can’t express our actual desires with a finite list of constraints”, just wasn’t with “an AI will circumvent constraints for kicks”.
I guess there’s a subtlety to it—if you assign: “you get 1 utilon per paperclip that exists, and you are permitted to manufacture 10 paperclips per day”, then we’ll get problematic side effects as described elsewhere. If you assign “you get 1 utilon per paperclip that you manufacture, up to a maximum of 10 paperclips/utilons per day” or something along those lines, I’m not convinced that any sort of “circumvention” behavior would occur (though the AI would probably wipe out all life to ensure that nothing could adversely affect its future paperclip production capabilities, so the distinction is somewhat academic).
What’s stopping us from adding ‘maintain constraints’ to the agent’s motive?
I agree (with this question) - what makes us so sure that “maximize paperclips” is the part of the utility function that the optimizer will really value? Couldn’t it symmetrically decide that “maximize paperclips” is a constraint on “try not to murder everyone”?
Asking what it really values is anthropomorphic. It’s not coming up with loopholes around the “don’t murder” people constraint because it doesn’t really value it, or because the paperclip part is its “real” motive.
It will probably come up with loopholes around the “maximize paperclips” constraint too—for example, if “paperclip” is defined by something paperclip-shaped, it will probably create atomic-scale nanoclips because these are easier to build than full-scale human-sized ones, much to the annoyance of the office-supply company that built it.
But paperclips are pretty simple. Add a few extra constraints and you can probably specify “paperclip” to a degree that makes them useful for office supplies.
Human values are really complex. “Don’t murder” doesn’t capture human values at all—if Clippy encases us in carbonite so that we’re still technically alive but not around to interfere with paperclip production, ve has fulfilled the “don’t murder” imperative, but we would count this as a fail. This is not Clippy’s “fault” for deliberately trying to “get around” the anti-murder constraint, it’s our “fault” for telling ver “don’t murder” when we really meant “don’t do anything bad”.
Building a genuine “respect” and “love” for the “don’t murder” constraint in Clippy wouldn’t help an iota against the carbonite scenario, because that’s not murder and we forgot to tell ver there should be a constraint against that too.
So you might ask: okay, but surely there are a finite number of constraints that capture what we want. Just build an AI with a thousand or ten thousand constraints, “don’t murder”, “don’t encase people in carbonite”, “don’t eat puppies”, etc., make sure the list is exhaustive and that’ll do it.
The first objection is that we might miss something. If the ancient Romans had made such a list, they might have forgotten “Don’t release damaging radiation that gives us cancer.” They certainly would have missed “Don’t enslave people”, because they were still enslaving people themselves—but this would mean it would be impossible to update the Roman AI for moral progress a few centuries down the line.
The second objection is that human morality isn’t just a system of constraints. Even if we could tell Clippy “Limit your activities to the Andromeda Galaxy and send us the finished clips” (which I think would still be dangerous), any more interesting AI that is going to interact with and help humans needs to realize that sometimes it is okay to engage in prohibited actions if they serve greater goals (for example, it can disable a crazed gunman to prevent a massacre, even though disabling people is usually verboten).
So to actually capture all possible constraints, and to capture the situations in which those constraints can and can’t be relaxed, we need to program all human values in. In that case we can just tell Clippy “Make paperclips in a way that doesn’t cause what we would classify as a horrifying catastrophe” and ve’ll say “Okay!” and not give us any trouble.
Historical notes. The Romans had laws against enslaving the free-born and also allowed manumission.
Thanks, this all makes sense and I agree. Asking what it “really” values was intentionally anthropomorphic, as I was asking about what “it will want to work around constraints” really meant in practical terms, a claim which I believe was made by others.
I’m totally on board with “we can’t express our actual desires with a finite list of constraints”, just wasn’t with “an AI will circumvent constraints for kicks”.
I guess there’s a subtlety to it—if you assign: “you get 1 utilon per paperclip that exists, and you are permitted to manufacture 10 paperclips per day”, then we’ll get problematic side effects as described elsewhere. If you assign “you get 1 utilon per paperclip that you manufacture, up to a maximum of 10 paperclips/utilons per day” or something along those lines, I’m not convinced that any sort of “circumvention” behavior would occur (though the AI would probably wipe out all life to ensure that nothing could adversely affect its future paperclip production capabilities, so the distinction is somewhat academic).
In any case, thanks for the detailed reply :)