What is the difference between “a rule” and “what it wants”?
I’m interpreting this as the same question you wrote below as “What is the difference between a constraint and what is optimized?”. Dave gave one example but a slightly different metaphor comes to my mind.
Imagine an amoral businessman in a country that takes half his earnings as tax. The businessman wants to maximize money, but has the constraint is that half his earnings get taken as tax. So in order to achieve his goal of maximizing money, the businessman sets up some legally permissible deal with a foreign tax shelter or funnels it to holding corporations or something to avoid taxes. Doing this is the natural result of his money-maximization goal, and satisfies the “pay taxes” constraint..
Contrast this to a second, more patriotic businessman who loved paying taxes because it helped his country, and so didn’t bother setting up tax shelters at all.
The first businessman has the motive “maximize money” and the constraint “pay taxes”; the second businessman has the motive “maximize money and pay taxes”.
From the viewpoint of the government, the first businessman is an unFriendly agent with a constraint, and the second businessman is a Friendly agent.
The first businessman has the motive “maximize money” and the constraint “pay taxes”; the second businessman has the motive “maximize money and pay taxes”.
I read your comment again. I now see the distinction. One merely tries to satisfy something while the other tries to optimize it as well. So your definition of a ‘failsafe’ is a constraint that is satisfied while something else is optimized. I’m just not sure how helpful such a distinction is as the difference is merely how two different parameters are optimized. One optimizes by maximizing money and tax paying while the other treats each goal differently, it tries to optimize tax paying by reducing it to a minimum while it tries to optimize money by maximizing the amount. This distinction doesn’t seem to matter at all if one optimization parameter (constraint or ‘failsafe’) is to shut down after running 10 seconds.
Very well put. I understood that line of reasoning from the very beginning though and didn’t disagree that complex goals need complex optimization parameters. But I was making a distinction between insufficient and unbounded optimization parameters, goal-stability and the ability or desire to override them. I am aware of the risk of telling an AI to compute as many digits of Pi as possible. What I wanted to say is that if time, space and energy are part of its optimization parameters then no matter how intelligent it is, it will not override them. If you tell the AI to compute as many digits of Pi as possible while only using a certain amount of time or energy for the purpose of optimizing and computing it then it will do so and hold. I’m not sure what is your definition of a ‘failsafe’ but making simple limits like time and space part of the optimization parameters sounds to me like one. What I mean by ‘optimization parameters’ are the design specifications of the subject of the optimization process, like what constitutes a paperclip. It has to use those design specifications to measure its efficiency and if time and space limits are part of it then it will take account of those parameters as well.
I’m not sure what is your definition of a ‘failsafe’ but making simple limits like time and space part of the optimization parameters sounds to me like one.
You also would have to limit the resources it spends to verify how near the limits it is, since it acts to get as close as possible as part of optimization. If you do not, it will use all resources for that. So you need an infinite tower of limits.
I agree (with this question) - what makes us so sure that “maximize paperclips” is the part of the utility function that the optimizer will really value? Couldn’t it symmetrically decide that “maximize paperclips” is a constraint on “try not to murder everyone”?
Asking what it really values is anthropomorphic. It’s not coming up with loopholes around the “don’t murder” people constraint because it doesn’t really value it, or because the paperclip part is its “real” motive.
It will probably come up with loopholes around the “maximize paperclips” constraint too—for example, if “paperclip” is defined by something paperclip-shaped, it will probably create atomic-scale nanoclips because these are easier to build than full-scale human-sized ones, much to the annoyance of the office-supply company that built it.
But paperclips are pretty simple. Add a few extra constraints and you can probably specify “paperclip” to a degree that makes them useful for office supplies.
Human values are really complex. “Don’t murder” doesn’t capture human values at all—if Clippy encases us in carbonite so that we’re still technically alive but not around to interfere with paperclip production, ve has fulfilled the “don’t murder” imperative, but we would count this as a fail. This is not Clippy’s “fault” for deliberately trying to “get around” the anti-murder constraint, it’s our “fault” for telling ver “don’t murder” when we really meant “don’t do anything bad”.
Building a genuine “respect” and “love” for the “don’t murder” constraint in Clippy wouldn’t help an iota against the carbonite scenario, because that’s not murder and we forgot to tell ver there should be a constraint against that too.
So you might ask: okay, but surely there are a finite number of constraints that capture what we want. Just build an AI with a thousand or ten thousand constraints, “don’t murder”, “don’t encase people in carbonite”, “don’t eat puppies”, etc., make sure the list is exhaustive and that’ll do it.
The first objection is that we might miss something. If the ancient Romans had made such a list, they might have forgotten “Don’t release damaging radiation that gives us cancer.” They certainly would have missed “Don’t enslave people”, because they were still enslaving people themselves—but this would mean it would be impossible to update the Roman AI for moral progress a few centuries down the line.
The second objection is that human morality isn’t just a system of constraints. Even if we could tell Clippy “Limit your activities to the Andromeda Galaxy and send us the finished clips” (which I think would still be dangerous), any more interesting AI that is going to interact with and help humans needs to realize that sometimes it is okay to engage in prohibited actions if they serve greater goals (for example, it can disable a crazed gunman to prevent a massacre, even though disabling people is usually verboten).
So to actually capture all possible constraints, and to capture the situations in which those constraints can and can’t be relaxed, we need to program all human values in. In that case we can just tell Clippy “Make paperclips in a way that doesn’t cause what we would classify as a horrifying catastrophe” and ve’ll say “Okay!” and not give us any trouble.
They certainly would have missed “Don’t enslave people”, because they were still enslaving people themselves—but this would mean it would be impossible to update the Roman AI for moral progress a few centuries down the line.
Historical notes. The Romans had laws against enslaving the free-born and also allowed manumission.
Thanks, this all makes sense and I agree. Asking what it “really” values was intentionally anthropomorphic, as I was asking about what “it will want to work around constraints” really meant in practical terms, a claim which I believe was made by others.
I’m totally on board with “we can’t express our actual desires with a finite list of constraints”, just wasn’t with “an AI will circumvent constraints for kicks”.
I guess there’s a subtlety to it—if you assign: “you get 1 utilon per paperclip that exists, and you are permitted to manufacture 10 paperclips per day”, then we’ll get problematic side effects as described elsewhere. If you assign “you get 1 utilon per paperclip that you manufacture, up to a maximum of 10 paperclips/utilons per day” or something along those lines, I’m not convinced that any sort of “circumvention” behavior would occur (though the AI would probably wipe out all life to ensure that nothing could adversely affect its future paperclip production capabilities, so the distinction is somewhat academic).
I’m interpreting this as the same question you wrote below as “What is the difference between a constraint and what is optimized?”. Dave gave one example but a slightly different metaphor comes to my mind.
Imagine an amoral businessman in a country that takes half his earnings as tax. The businessman wants to maximize money, but has the constraint is that half his earnings get taken as tax. So in order to achieve his goal of maximizing money, the businessman sets up some legally permissible deal with a foreign tax shelter or funnels it to holding corporations or something to avoid taxes. Doing this is the natural result of his money-maximization goal, and satisfies the “pay taxes” constraint..
Contrast this to a second, more patriotic businessman who loved paying taxes because it helped his country, and so didn’t bother setting up tax shelters at all.
The first businessman has the motive “maximize money” and the constraint “pay taxes”; the second businessman has the motive “maximize money and pay taxes”.
From the viewpoint of the government, the first businessman is an unFriendly agent with a constraint, and the second businessman is a Friendly agent.
Does that help answer your question?
I read your comment again. I now see the distinction. One merely tries to satisfy something while the other tries to optimize it as well. So your definition of a ‘failsafe’ is a constraint that is satisfied while something else is optimized. I’m just not sure how helpful such a distinction is as the difference is merely how two different parameters are optimized. One optimizes by maximizing money and tax paying while the other treats each goal differently, it tries to optimize tax paying by reducing it to a minimum while it tries to optimize money by maximizing the amount. This distinction doesn’t seem to matter at all if one optimization parameter (constraint or ‘failsafe’) is to shut down after running 10 seconds.
Very well put. I understood that line of reasoning from the very beginning though and didn’t disagree that complex goals need complex optimization parameters. But I was making a distinction between insufficient and unbounded optimization parameters, goal-stability and the ability or desire to override them. I am aware of the risk of telling an AI to compute as many digits of Pi as possible. What I wanted to say is that if time, space and energy are part of its optimization parameters then no matter how intelligent it is, it will not override them. If you tell the AI to compute as many digits of Pi as possible while only using a certain amount of time or energy for the purpose of optimizing and computing it then it will do so and hold. I’m not sure what is your definition of a ‘failsafe’ but making simple limits like time and space part of the optimization parameters sounds to me like one. What I mean by ‘optimization parameters’ are the design specifications of the subject of the optimization process, like what constitutes a paperclip. It has to use those design specifications to measure its efficiency and if time and space limits are part of it then it will take account of those parameters as well.
You also would have to limit the resources it spends to verify how near the limits it is, since it acts to get as close as possible as part of optimization. If you do not, it will use all resources for that. So you need an infinite tower of limits.
What’s stopping us from adding ‘maintain constraints’ to the agent’s motive?
I agree (with this question) - what makes us so sure that “maximize paperclips” is the part of the utility function that the optimizer will really value? Couldn’t it symmetrically decide that “maximize paperclips” is a constraint on “try not to murder everyone”?
Asking what it really values is anthropomorphic. It’s not coming up with loopholes around the “don’t murder” people constraint because it doesn’t really value it, or because the paperclip part is its “real” motive.
It will probably come up with loopholes around the “maximize paperclips” constraint too—for example, if “paperclip” is defined by something paperclip-shaped, it will probably create atomic-scale nanoclips because these are easier to build than full-scale human-sized ones, much to the annoyance of the office-supply company that built it.
But paperclips are pretty simple. Add a few extra constraints and you can probably specify “paperclip” to a degree that makes them useful for office supplies.
Human values are really complex. “Don’t murder” doesn’t capture human values at all—if Clippy encases us in carbonite so that we’re still technically alive but not around to interfere with paperclip production, ve has fulfilled the “don’t murder” imperative, but we would count this as a fail. This is not Clippy’s “fault” for deliberately trying to “get around” the anti-murder constraint, it’s our “fault” for telling ver “don’t murder” when we really meant “don’t do anything bad”.
Building a genuine “respect” and “love” for the “don’t murder” constraint in Clippy wouldn’t help an iota against the carbonite scenario, because that’s not murder and we forgot to tell ver there should be a constraint against that too.
So you might ask: okay, but surely there are a finite number of constraints that capture what we want. Just build an AI with a thousand or ten thousand constraints, “don’t murder”, “don’t encase people in carbonite”, “don’t eat puppies”, etc., make sure the list is exhaustive and that’ll do it.
The first objection is that we might miss something. If the ancient Romans had made such a list, they might have forgotten “Don’t release damaging radiation that gives us cancer.” They certainly would have missed “Don’t enslave people”, because they were still enslaving people themselves—but this would mean it would be impossible to update the Roman AI for moral progress a few centuries down the line.
The second objection is that human morality isn’t just a system of constraints. Even if we could tell Clippy “Limit your activities to the Andromeda Galaxy and send us the finished clips” (which I think would still be dangerous), any more interesting AI that is going to interact with and help humans needs to realize that sometimes it is okay to engage in prohibited actions if they serve greater goals (for example, it can disable a crazed gunman to prevent a massacre, even though disabling people is usually verboten).
So to actually capture all possible constraints, and to capture the situations in which those constraints can and can’t be relaxed, we need to program all human values in. In that case we can just tell Clippy “Make paperclips in a way that doesn’t cause what we would classify as a horrifying catastrophe” and ve’ll say “Okay!” and not give us any trouble.
Historical notes. The Romans had laws against enslaving the free-born and also allowed manumission.
Thanks, this all makes sense and I agree. Asking what it “really” values was intentionally anthropomorphic, as I was asking about what “it will want to work around constraints” really meant in practical terms, a claim which I believe was made by others.
I’m totally on board with “we can’t express our actual desires with a finite list of constraints”, just wasn’t with “an AI will circumvent constraints for kicks”.
I guess there’s a subtlety to it—if you assign: “you get 1 utilon per paperclip that exists, and you are permitted to manufacture 10 paperclips per day”, then we’ll get problematic side effects as described elsewhere. If you assign “you get 1 utilon per paperclip that you manufacture, up to a maximum of 10 paperclips/utilons per day” or something along those lines, I’m not convinced that any sort of “circumvention” behavior would occur (though the AI would probably wipe out all life to ensure that nothing could adversely affect its future paperclip production capabilities, so the distinction is somewhat academic).
In any case, thanks for the detailed reply :)