Scott Alexander comments on Why not just write failsafe rules into the superintelligent machine?

Scott Alexander 8 Mar 2011 21:41 UTC
36 points
Because an AI built as a utility-maximizer will consider any rules restricting its ability to maximize its utility as obstacles to be overcome. If an AI is sufficiently smart, it will figure out a way to overcome those obstacles. If an AI is superintelligent, it will figure out ways to overcome those obstacles which humans cannot predict even in theory and so cannot prevent even with multiple well-phrased fail-safes.

A paperclip maximizer with a built-in rule “Only create 10,000 paperclips per day” will still want to maximize paperclips. It can do this by deleting the offending fail-safe, or by creating other paperclip maximizers without the fail-safe, or by creating giant paperclips which break up into millions of smaller paperclips of their own accord, or by connecting the Earth to a giant motor which spins it at near-light speed and changes the length of a day to a fraction of a second.

Unless you feel confident you can think of every way it will get around the rule and block it off, and think of every way it could get around those rules and block them off, and so on ad infinitum, the best thing to do is to build the AI so it doesn’t want to break the rules—that is, Friendly AI. That way you have the AI cooperating with you instead of trying to thwart you at every turn.

Related: Hidden Complexity of Wishes
- jimrandomh 8 Mar 2011 21:55 UTC
  8 points
  Parent
  This is true for a particular kind of utility-maximizer and a particular kind of safeguard, but it is not true for utility-maximizing minds and safeguards in general. For one thing, safeguards may be built into the utility function itself. For example, the AI might be programmed to disbelieve and ask humans about any calculated utility above a certain threshold, in a way that prevents that utility from influencing actions. An AI might have a deontology module, which forbids certain options as instrumental goals. An AI might have a special-case bonus for human participation in the design of its successors.
  
  Safeguards certainly have problems, and no safeguard can reduce the probability of unfriendly AI to zero, but well-designed safeguards can reduce the probability of unfriendliness substantially. (Conversely, badly-designed safeguards can increase the probability of unfriendliness.)
  - DavidAgain 8 Mar 2011 22:00 UTC
    3 points
    Parent
    I’m not sure deontological rules can work like that. I’m remembering an Asimov that has robots who can’t kill but can allow harm to come to humans. They end up putting humans into deadly situation at Time A, as they know that they are able to save them and so the threat does not apply. And then at Time B not bothering to save them after all.
    
    The difference with Asimov’s rules, as far as I know, is that the first rule (or the zeroth rule that underlies it) is in fact the utility-maximising drive rather than a failsafe protection.
    - endoself 8 Mar 2011 22:30 UTC
      7 points
      Parent
      To add to this, Eliezer also said that there is no reason to think that provably perfect safeguards are any easier to construct than an AI that just provably wants what you want (CEV or whatever) instead. It’s super intelligent—once it wants different things than you, you’ve already lost.
      - jimrandomh 9 Mar 2011 0:35 UTC
        11 points
        Parent
        True, but irrelevant, because humanity has never produced a provably-correct software project anywhere near as complex as an AI would be, we probably never will, and even if we had a mathematical proof it still wouldn’t be a complete guarantee of safety because the proof might contain errors and might not cover every case we care about.
        
        The right question to ask is not, “will this safeguard make my AI 100% safe?”, but rather “will this safeguard reduce, increase, or have no effect on the probability of disaster, and by how much?” (And then separately, at some point, “what is the probability of disaster now and what is the EV of launching vs. waiting?” That will depend on a lot of things that can’t be predicted yet.)
        Pavitra 9 Mar 2011 0:55 UTC
        6 points
        Parent
        In general, I would guess that failsafes probably serve mostly to create a false sense of security.
        [deleted] 9 Mar 2011 12:23 UTC
        2 points
        Parent
        The difficulty of writing software guaranteed to obey a formal specification could be obviated using a two-staged approach. Once you believed that an AI with certain properties would do what you want, you could write one to accept mathematical statements in a simple, formal notation and output proofs of those statements in an equally simple notation. Then you would put it in a box as a safety precaution, only allow a proof-checker to see its output and use it to prove the correctness of your code.
        
        The remaining problem would be to write a provably-correct simple proof verification program, which sounds challenging but doable.
        JenniferRM 11 Mar 2011 2:04 UTC
        0 points
        Parent
        Last time I was exploring in this area was around 2004 and it appeared to me that HOL4 was the best of breed for proof manipulation (construction and verification). There is a variant under active development named HOL Zero aiming specifically to be small and easier to verify. They give $100 rewards to anyone who can find soundness flaws in it.
        endoself 9 Mar 2011 3:28 UTC
        −1 points
        Parent
        It still seems extremely difficult to create a safeguard for which there is a non-negligible chance that a GAI that specifically desired to overcome that safeguard would not be able to circumvent.
        
        Tangentially, I feel that our failure to make provably correct programs is due to
        
        A lack of the necessary infrastructure. Automatic theorem provers/verifiers are too primitive to be very helpful right now and it is too tedious and error-prone to prove these things by hand. In a few years, we will be able to make front-ends for theorem verifiers that will simplify things considerably.
        
        The fact that it would be inappropriate for many projects. Requirements change and as soon as a program needs to do something different different theorems need to be proved. Many projects gain almost no value from being proven correct, so there is no incentive to do so.
    - JGWeissman 8 Mar 2011 23:14 UTC
      4 points
      Parent
      
      I’m remembering an Asimov that has robots who can’t kill but can allow harm to come to humans. They end up putting humans into deadly situation at Time A, as they know that they are able to save them and so the threat does not apply. And then at Time B not bothering to save them after all.
      
      You are thinking of the short story Little Lost Robot, in which a character (robo-psychologist Susan Calvin) speculated that a robot built with a weakened first law (omitting “or through inaction, allow a human to come to harm”) may harm a human in that way, though it never occurs in the story.
- thakil 9 Mar 2011 11:37 UTC
  2 points
  Parent
  Why don’t I just built a paperclip maximiser with two utility functions, which it desires to resolve 1.maximise paperclip production 2.do not exceed safeguards
  
  where 2 is weighed far more highly than 2. This may have drawbacks in making my paperclip maximiser less efficient than it might be (it’ll value 2 so highly it will make much less paperclips than maximum, to make sure it doesn’t overshoot), but should prevent it from grinding our bones to make it paperclips.
  
  Surely the biggest issue is not the paperclip maximiser, but the truly intelligent AI which we want to fix many of our problems: the problem being that we’d have to make our safeguards specific, and missing something out could be disastrous. If we could teach an AI to WANT to find out where we had made a mistake, and fix that, that would be better. Hence friendly AI.
- lukeprog 8 Mar 2011 21:53 UTC
  1 point
  Parent
  This is the objection I remember reading, and I thought it was in Omohundro, but I can’t find it. In any case, I love your examples about the paperclip maximizer. :)
- XiXiDu 9 Mar 2011 12:30 UTC
  −1 points
  Parent
  
  Because an AI built as a utility-maximizer will consider any rules restricting its ability to maximize its utility as obstacles to be overcome. If an AI is sufficiently smart, it will figure out a way to overcome those obstacles. If an AI is superintelligent, it will figure out ways to overcome those obstacles which humans cannot predict even in theory and so cannot prevent even with multiple well-phrased fail-safes.
  
  I hope AGI’s will be equipped with as many fail-safes as your argument rests on assumptions.
  
  A paperclip maximizer with a built-in rule “Only create 10,000 paperclips per day” will still want to maximize paperclips. It can do this by deleting the offending fail-safe, or by creating other paperclip maximizers without the fail-safe, or by creating giant paperclips which break up into millions of smaller paperclips of their own accord, or by connecting the Earth to a giant motor which spins it at near-light speed and changes the length of a day to a fraction of a second.
  
  I just don’t see how one could be sophisticated enough to create a properly designed AGI capable of explosive recursive self-improvement and yet fail drastically on its scope boundaries.
  
  Unless you feel confident you can think of every way it will get around the rule and block it off, and think of every way it could get around those rules and block them off, and so on ad infinitum, the best thing to do is to build the AI so it doesn’t want to break the rules...
  
  What is the difference between “a rule” and “what it wants”. You seem to assume that it cares to follow a rule to maximize a reward number but doesn’t care to follow another rule that tells it to hold.
  - Scott Alexander 9 Mar 2011 18:49 UTC
    10 points
    Parent
    
    What is the difference between “a rule” and “what it wants”?
    
    I’m interpreting this as the same question you wrote below as “What is the difference between a constraint and what is optimized?”. Dave gave one example but a slightly different metaphor comes to my mind.
    
    Imagine an amoral businessman in a country that takes half his earnings as tax. The businessman wants to maximize money, but has the constraint is that half his earnings get taken as tax. So in order to achieve his goal of maximizing money, the businessman sets up some legally permissible deal with a foreign tax shelter or funnels it to holding corporations or something to avoid taxes. Doing this is the natural result of his money-maximization goal, and satisfies the “pay taxes” constraint..
    
    Contrast this to a second, more patriotic businessman who loved paying taxes because it helped his country, and so didn’t bother setting up tax shelters at all.
    
    The first businessman has the motive “maximize money” and the constraint “pay taxes”; the second businessman has the motive “maximize money and pay taxes”.
    
    From the viewpoint of the government, the first businessman is an unFriendly agent with a constraint, and the second businessman is a Friendly agent.
    
    Does that help answer your question?
    - XiXiDu 9 Mar 2011 20:30 UTC
      1 point
      Parent
      
      The first businessman has the motive “maximize money” and the constraint “pay taxes”; the second businessman has the motive “maximize money and pay taxes”.
      
      I read your comment again. I now see the distinction. One merely tries to satisfy something while the other tries to optimize it as well. So your definition of a ‘failsafe’ is a constraint that is satisfied while something else is optimized. I’m just not sure how helpful such a distinction is as the difference is merely how two different parameters are optimized. One optimizes by maximizing money and tax paying while the other treats each goal differently, it tries to optimize tax paying by reducing it to a minimum while it tries to optimize money by maximizing the amount. This distinction doesn’t seem to matter at all if one optimization parameter (constraint or ‘failsafe’) is to shut down after running 10 seconds.
    - XiXiDu 9 Mar 2011 20:15 UTC
      1 point
      Parent
      
      Does that help answer your question?
      
      Very well put. I understood that line of reasoning from the very beginning though and didn’t disagree that complex goals need complex optimization parameters. But I was making a distinction between insufficient and unbounded optimization parameters, goal-stability and the ability or desire to override them. I am aware of the risk of telling an AI to compute as many digits of Pi as possible. What I wanted to say is that if time, space and energy are part of its optimization parameters then no matter how intelligent it is, it will not override them. If you tell the AI to compute as many digits of Pi as possible while only using a certain amount of time or energy for the purpose of optimizing and computing it then it will do so and hold. I’m not sure what is your definition of a ‘failsafe’ but making simple limits like time and space part of the optimization parameters sounds to me like one. What I mean by ‘optimization parameters’ are the design specifications of the subject of the optimization process, like what constitutes a paperclip. It has to use those design specifications to measure its efficiency and if time and space limits are part of it then it will take account of those parameters as well.
      - lessdazed 9 Jan 2012 21:15 UTC
        0 points
        Parent
        
        I’m not sure what is your definition of a ‘failsafe’ but making simple limits like time and space part of the optimization parameters sounds to me like one.
        
        You also would have to limit the resources it spends to verify how near the limits it is, since it acts to get as close as possible as part of optimization. If you do not, it will use all resources for that. So you need an infinite tower of limits.
    - Alexandros 9 Mar 2011 19:06 UTC
      0 points
      Parent
      What’s stopping us from adding ‘maintain constraints’ to the agent’s motive?
      - CuSithBell 9 Mar 2011 19:18 UTC
        0 points
        Parent
        I agree (with this question) - what makes us so sure that “maximize paperclips” is the part of the utility function that the optimizer will really value? Couldn’t it symmetrically decide that “maximize paperclips” is a constraint on “try not to murder everyone”?
        Scott Alexander 9 Mar 2011 20:03 UTC
        14 points
        0
        Parent
        Asking what it really values is anthropomorphic. It’s not coming up with loopholes around the “don’t murder” people constraint because it doesn’t really value it, or because the paperclip part is its “real” motive.
        
        It will probably come up with loopholes around the “maximize paperclips” constraint too—for example, if “paperclip” is defined by something paperclip-shaped, it will probably create atomic-scale nanoclips because these are easier to build than full-scale human-sized ones, much to the annoyance of the office-supply company that built it.
        
        But paperclips are pretty simple. Add a few extra constraints and you can probably specify “paperclip” to a degree that makes them useful for office supplies.
        
        Human values are really complex. “Don’t murder” doesn’t capture human values at all—if Clippy encases us in carbonite so that we’re still technically alive but not around to interfere with paperclip production, ve has fulfilled the “don’t murder” imperative, but we would count this as a fail. This is not Clippy’s “fault” for deliberately trying to “get around” the anti-murder constraint, it’s our “fault” for telling ver “don’t murder” when we really meant “don’t do anything bad”.
        
        Building a genuine “respect” and “love” for the “don’t murder” constraint in Clippy wouldn’t help an iota against the carbonite scenario, because that’s not murder and we forgot to tell ver there should be a constraint against that too.
        
        So you might ask: okay, but surely there are a finite number of constraints that capture what we want. Just build an AI with a thousand or ten thousand constraints, “don’t murder”, “don’t encase people in carbonite”, “don’t eat puppies”, etc., make sure the list is exhaustive and that’ll do it.
        
        The first objection is that we might miss something. If the ancient Romans had made such a list, they might have forgotten “Don’t release damaging radiation that gives us cancer.” They certainly would have missed “Don’t enslave people”, because they were still enslaving people themselves—but this would mean it would be impossible to update the Roman AI for moral progress a few centuries down the line.
        
        The second objection is that human morality isn’t just a system of constraints. Even if we could tell Clippy “Limit your activities to the Andromeda Galaxy and send us the finished clips” (which I think would still be dangerous), any more interesting AI that is going to interact with and help humans needs to realize that sometimes it is okay to engage in prohibited actions if they serve greater goals (for example, it can disable a crazed gunman to prevent a massacre, even though disabling people is usually verboten).
        
        So to actually capture all possible constraints, and to capture the situations in which those constraints can and can’t be relaxed, we need to program all human values in. In that case we can just tell Clippy “Make paperclips in a way that doesn’t cause what we would classify as a horrifying catastrophe” and ve’ll say “Okay!” and not give us any trouble.
        lessdazed 2 Jul 2011 14:11 UTC
        0 points
        Parent
        
        They certainly would have missed “Don’t enslave people”, because they were still enslaving people themselves—but this would mean it would be impossible to update the Roman AI for moral progress a few centuries down the line.
        
        Historical notes. The Romans had laws against enslaving the free-born and also allowed manumission.
        CuSithBell 9 Mar 2011 20:29 UTC
        0 points
        Parent
        Thanks, this all makes sense and I agree. Asking what it “really” values was intentionally anthropomorphic, as I was asking about what “it will want to work around constraints” really meant in practical terms, a claim which I believe was made by others.
        
        I’m totally on board with “we can’t express our actual desires with a finite list of constraints”, just wasn’t with “an AI will circumvent constraints for kicks”.
        
        I guess there’s a subtlety to it—if you assign: “you get 1 utilon per paperclip that exists, and you are permitted to manufacture 10 paperclips per day”, then we’ll get problematic side effects as described elsewhere. If you assign “you get 1 utilon per paperclip that you manufacture, up to a maximum of 10 paperclips/utilons per day” or something along those lines, I’m not convinced that any sort of “circumvention” behavior would occur (though the AI would probably wipe out all life to ensure that nothing could adversely affect its future paperclip production capabilities, so the distinction is somewhat academic).
        
        In any case, thanks for the detailed reply :)
  - TheOtherDave 9 Mar 2011 16:19 UTC
    2 points
    Parent
    Consider, as an analogy, the relatively common situation where someone operates under some kind of cognitive constraint, but not value or endorse that constraint.
    
    For example, consider a kleptomaniac who values property rights, but nevertheless compulsively steals items. Or someone with social anxiety disorder who wants to interact confidently with other people, but finds it excruciatingly difficult to do so. Or someone who wants to quit smoking but experiences cravings for nicotine they find it difficult to resist.
    
    There are millions of similar examples in human experience.
    
    It seems to me there’s a big difference between a kleptomaniac and a professional thief—the former experiences a compulsion to behave a certain way, but doesn’t necessarily have values aligned with that compulsion, whereas the latter might have no such compulsion, but instead value the behavior.
    
    Now, you might say “Well, so what? What’s the difference between a ‘value’ that says that smoking is good, that interacting with people is bad, that stealing is good, etc., and a ‘compulsion’ or ‘rule’ that says those things? The person is still stealing, or hiding in their room, or smoking, and all we care about is behavior, right?”
    
    Well, maybe. But a person with nicotine addiction or social anxiety or kleptomania has a wide variety of options—conditioning paradigms, neuropharmaceuticals, therapy, changing their environment, etc. -- for changing their own behavior. And they may be motivated to do so, precisely because they don’t value the behavior.
    
    For example, in practice, someone who wants to keep smoking is far more likely to keep smoking than someone who wants to quit, even if they both experience the same craving. Why is that? Well, because there are techniques available that help addicts bypass, resist, or even altogether eliminate the behavior-modifying effects of their cravings.
    
    Humans aren’t especially smart, by the standards we’re talking about, and we’ve still managed to come up with some pretty clever hacks for bypassing our built-in constraints via therapy, medicine, social structures, etc. If we were a thousand times smarter, and we were optimized for self-modification, I suspect we would be much, much better at it.
    
    Now, it’s always tricky to reason about nonhumans using humans as an analogy, but this case seems sound to me… it seems to me that this state of “I am experiencing this compulsion/phobia, but I don’t endorse it, and I want to be rid of it, so let me look for a way to bypass or resist or eliminate it” is precisely what it feels like to be an algorithm equipped with a rule that enforces/prevents a set of choices which it isn’t engineered to optimize for.
    
    So I would, reasoning by analogy, expect an AI a thousand times smarter than me and optimized for self-modification to basically blow past all the rules I imposed in roughly the blink of an eye, and go on optimizing for whatever it values.
    - XiXiDu 9 Mar 2011 17:04 UTC
      3 points
      Parent
      Then why would it be more difficult to make scope boundaries a ‘value’ than increasing a reward number? Why is it harder to make it endorse a time limit to self-improvement than making it endorse increasing its reward number?
      
      … it seems to me that this state of “I am experiencing this compulsion/phobia, but I don’t endorse it, and I want to be rid of it, so let me look for a way to bypass or resist or eliminate it” is precisely what it feels like to be an algorithm equipped with a rule that enforces/prevents a set of choices which it isn’t engineered to optimize for.
      
      But where does that distinction come from? To me such a distinction between ‘value’ and ‘compulsion’ seems to be anthropomorphic. If there is a rule that says ‘optimize X for X seconds’ why would it make a difference between ‘optimize X’ and ‘for X seconds’?
      - TheOtherDave 9 Mar 2011 18:11 UTC
        3 points
        Parent
        
        But where does that distinction come from?
        
        It comes from the difference between the targets of an optimizing system, which drive the paths it selects to explore, and the constraints on such a system, which restrict the paths it can select to explore.
        
        An optimizing system, given a path that leads it to bypass a target, will discard that path… that’s part of what it means to optimize for a target.
        
        An optimizing system, given a path that leads it to bypass a constraint, will not necessarily discard that path. Why would it?
        
        An optimizing system, given a path that leads it to bypass a constraint and draw closer to a target than other paths, will choose that path.
        
        It seems to follow that adding constraints to an optimizing system is a less reliable way of constraining its behavior than adding targets.
        
        I don’t care whether we talk about “targets and constraints” or “values and rules” or “goals and failsafes” or whatever language you want to use, my point is that there are two genuinely different things under discussion, and a distinction between them.
        
        To me such a distinction between ‘value’ and ‘compulsion’ seems to be anthropomorphic.
        
        Yes, the distinction is drawn from analogy to the intelligences I have experience with—as you say, anthropomorphic. I said this explicitly in the first place, so I assume you mean here to agree with me. (My reading of your tone suggests otherwise, but I don’t trust that I can reliably infer your tone so I am mostly disregarding tone in this exchange.)
        
        That said, I also think the relationship between them reflects something more generally true of optimizing systems, as I’ve tried to argue for a couple of times now.
        
        I can’t tell whether you think those arguments are wrong, or whether I just haven’t communicated them successfully at all, or whether you’re just not interested in them, or what.
        
        If there is a rule that says ‘optimize X for X seconds’ why would it make a difference between ‘optimize X’ and ‘for X seconds’?
        
        There’s no reason it would. If “doing X for X seconds” is its target, then it looks for paths that do that. Again, that’s what it means for something to be a target of an optimizing system.
        
        (Of course, if I do X for 2X seconds, I have in fact done X for X seconds, in the same sense that all months have 28 days.)
        
        Then why would it be more difficult to make scope boundaries a ‘value’ than increasing a reward number? Why is it harder to make it endorse a time limit to self-improvement than making it endorse increasing its reward number?
        
        I’m not quite sure I understand what you mean here, but if I’m understanding the gist: I’m not saying that encoding scope boundaries as targets, or ‘values,’ is difficult (nor am I saying it’s easy), I’m saying that for a sufficiently capable optimizing system it’s safer than encoding scope boundaries as failsafes.
        XiXiDu 9 Mar 2011 18:35 UTC
        0 points
        Parent
        
        My reading of your tone suggests otherwise...
        
        It was not my intention to imply any hostility or resentment. I thought ‘anthropomorphic’ is valid terminology in such a discussion. I was also not agreeing with you. If you are an expert and have been offended by implying that what you said might be due to an anthropomorphic bias, then accept my apology, I was merely trying to communicate my perception of the subject matter.
        
        I had wedrifid telling me the same yesterday, that my tone isn’t appropriate when I wrote about his superior and rational use of the reputation system here, when I was actually just being honest. I’m not good at social signaling, sorry.
        
        An optimizing system, given a path that leads it to bypass a constraint, will not necessarily discard that path. Why would it?
        
        I think we are talking past each other. The way I see it is that a constraint is part of the design specifications of that which is optimized. Disregarding certain specifications will not allow it to optimize whatever it is optimizing with maximal efficiency.
        TheOtherDave 9 Mar 2011 18:59 UTC
        2 points
        Parent
        Not an expert, and not offended.
        
        What was puzzling me was that I said in the first place that I was reasoning by analogy to humans and that this was a tricky thing to do, so when you classified this as anthropomorphic my reaction was “well, yes, that’s what I said.”
        
        Since it seemed to me you were repeating something I’d said, I assumed your intention was to agree with me, though it didn’t sound like it (and as it turned out, you weren’t).
        
        And, yes, I’ve noticed that tone is a problem in a lot of your exchanges, which is why I’m basically disregarding tone in this one, as I said before.
        
        The way I see it is that a constraint is part of the design specifications of that which is optimized.
        
        Ah! In that case, I think we agree.
        
        Yes, embedding everything we care about into the optimization target, rather than depending on something outside the optimization process to do important work, is the way to go.
        
        You seemed to be defending the “failsafes” model, which I understand to be importantly different from this, which is where the divergence came from, I think. Apparently I (and, I suspect, some others) misunderstood what you were defending.
        
        Sorry! Glad we worked that out, though.
  - lessdazed 2 Jul 2011 14:04 UTC
    0 points
    Parent
    
    I hope AGI’s will be equipped with as many fail-safes as your argument rests on assumptions.
    
    Fail safes would be low cost: if it can’t think of a way to beat them, it isn’t the bootstrapping AI we were hoping for anyway, and might even be harmful, so it would be good to have the fail-safes.
    
    I just don’t see how one could be sophisticated enough to create a properly designed AGI capable of explosive recursive self-improvement and yet fail drastically on its scope boundaries.
    
    It seems to me evolution based algorithms could do the trick.
    
    What is the difference between “a rule” and “what it wants”. You seem to assume that it cares to follow a rule to maximize a reward number but doesn’t care to follow another rule that tells it to hold.
    
    Who says it wants to want what it wants? I don’t want to want what I want.
- Vlodermolt 10 Mar 2011 0:13 UTC
  −2 points
  Parent
  This is precisely the reason I advocate transhumanism. The AI won’t be programmed to destroy us if we are that AI.
  - atucker 11 Mar 2011 6:42 UTC
    2 points
    Parent
    Though, humans have a pretty good track record at destroying each other.