[deleted] comments on Debunking Fallacies in the Theory of AI Motivation

[deleted]May 14, 2015, 2:29 PM
6 points
Now, you are trying to put your finger on a difference between two versions of the DLI that you think I have supplied.

You have paraphrased the two versions as:

Doing dumb things because you think they are correct

and

[Doing dumb things and] realising their dumbness, but being tragically compelled to do them anyway.

I think you are seeing some valid issues here, having to do with how to characterize what exactly it is that this AI is supposed to be ‘thinking’ when it goes through this process.

I have actually thought about that a lot, too, and my conclusion is that we should not beat ourselves up trying to figure out precisely what the difference might be between these nuanced versions of the idea, because the people who are proposing this idea in the first place have not themselves been clear enough about what is meant.

For example, you talked about “Doing dumb things because you think are correct” …. but what does it mean to say that you ‘think’ that they are correct? To me, as a human, that seems to entail being completely unaware of the evidence that they might not be correct (“Jill took the ice-cream from Jack because she didn’t know that it was wrong to take someone else’s ice-cream.”). The problem is, we are talking about an AI, and some people talk as if the AI can run its planning engine, then feel compelled to obey the planning engine … while at the same time being fully cognizant of evidence that the planning engine produced a crappy plan. There is no easy counterpart to that in humans (except for cognitive dissonance, and there we have a case where the human is capable of compartmentalizing its beliefs …. something that is not being suggested here, because we are not forced to make the AI do that). So, since the AI case does not map on to the human case, we are left in a peculiar situation where it is not at all clear that the AI really COULD do what is proposed, and still operate as a successful intelligence.

Or, more immediately, it is not at all clear that we can say about that AI “It did a dumb thing because it ‘thought’ it was correct.”

I should add that in both of my quoted descriptions of the DLI that you gave, I see no substantial difference (beyond those imponderables I just mentioned) and that in both cases I was actually trying to say something very close to the second paraphrase that you gave, namely:

[Doing dumb things and] realising their dumbness, but being tragically compelled to do them anyway.

And, don’t forget: I am not saying that such an AI is viable at all! Other people are suggesting some such AI, and I am arguing that the design is so logically incoherent that the AI (if it could be made to exist) would call attention to that problem and suggest means to correct it.

Anyhow, the takeway from this comment is: the people who talk about an AI that exhibits this kind of behavior are actually suggesting a behavior that they have not really thought through carefully, so as a result we can find ourselves walking into a minefield if we go and try to clean up the mess that they left.
- TheAncientGeek May 15, 2015, 8:15 PM
  4 points
  Parent
  
  Doing dumb things and] realising their dumbness, but being tragically compelled to do them anyway.
  
  And, don’t forget: I am not saying that such an AI is viable at all!
  
  If viable means it could be built, I think it could, given a string of assumptions. If viable means it would be built, by component and benign programmers, I am not so sure,
  - [deleted]May 15, 2015, 9:36 PM
    5 points
    Parent
    I actually meant “viable” in the sense of the third of my listed cases of incoherence at: http://lesswrong.com/lw/m5c/debunking_fallacies_in_the_theory_of_ai_motivation/cdap
    
    In other words, I seriously believe that using certain types of planning mechanism you absolutely would get the crazy (to us) behaviors described by all those folks that I criticised in the paper.
    
    Only reason I am not worried about that is: those kinds of planning mechanisms are known to do that kind of random-walk behavior, and it is for that reason that they will never be the basis for a future AGI that makes it up to a level of superintelligence at which the system would be dangerous. An AI that was so dumb that it did that kind of thing all the way through its development would never learn enough about the world to outsmart humanity.
    
    (Which is NOT to say, as some have inferred, that I believe an AI is “dumb” just because it does things that conflict with my value system, etc. etc. It would be dumb because its goal system would be spewing out incoherent behaviors all the time, and that is kinda the standard definition of “dumb”).
    - TheAncientGeek May 17, 2015, 7:31 AM
      6 points
      Parent
      
      Why is it that the AI does this only when it encounters a goal such as “make humans happy”, and not in a million other goals?
      
      MIRI distinguishes between terminal and instrumental goals, so there are two answers to the question
      
      instrumental goals of any kind almost certainly would be revised if they became noticeably out of correspondence to reality, because that would make then less effective at achieving terminal goals , and the raison d’etre of such transient sub-goals is is to support the achievement of terminal goals.
      
      By MIRIs reasoning, a terminal goal could be any of a 1000 things other than human happiness , and the same conclusion would follow: an AI with a highest priority terminal goal wouldn’t have any motivation to override it. To be motivated to rewrite a goal because it false implies a higher priority goal towards truth. It should not be surprising that an entity that doesn’t value truth, in a certain sense, doesn’t behave rationally, in a certain sense. (Actually, there is a bunch of supplementary assumptions involved, which I have dealt with elsewhere)
      
      That’s an account of the MIRI position, not a defence if it. It is essentially a model of rational decision making, and there is a gap between it and real world AI research, a gap which MIRI routinely ignores. The conclusion follows logically from the premises, but atoms aren’t pushed around by logic,
      
      In other words, I seriously believe that using certain types of planning mechanism you absolutely would get the crazy (to us) behaviors described by all those folks that I criticised in the paper.Only reason I am not worried about that is: those kinds of planning mechanisms are known to do that kind of random-walk behavior, and it is for that reason that they will never be the basis for a future AGI that makes it up to a level of superintelligence at which the system would be dangerous. An AI that was so dumb that it did that kind of t
      
      That reinforces my point. I was saying that MIRI is basically making armchair assumptions about the AI architectures. You are saying these assumptions aren’t merely unjustified, they go against what a competent AI builder would do.
      - [deleted]May 17, 2015, 5:34 PM
        −3 points
        Parent
        Understood, and the bottom line is that the distinction between “terminal” and “instrumental” goals is actually pretty artificial, so if the problem with “maximize friendliness” is supposed to apply ONLY if it is terminal, it is a trivial fix to rewrite the actual terminal goals to make that one become instrumental.
        
        But there is a bigger question lurking in the background, which is the flip side of what I just said: it really isn’t necessary to restrict the terminal goals, if you are sensitive to the power of constraints to keep a motivation system true. Notice one fascinating thing here: the power of constraint is basically the justification for why instrumental goals should be revisable under evidence of misbehavior …. it is the context mismatch that drives that process. Why is this fascinating? Because the power of constraints (aka context mismatch) is routinely acknowledged by MIRI here, but flatly ignored or denied for the terminal goals.
        
        It’s just a mess. Their theoretical ideas are just shoot-from-the-hip, plus some math added on top to make it look like some legit science.
        TheAncientGeek May 18, 2015, 9:11 PM
        1 point
        Parent
        
        Understood, and the bottom line is that the distinction between “terminal” and “instrumental” goals is actually pretty artificial, so if the problem with “maximize friendliness” is supposed to apply ONLY if it is terminal, it is a trivial fix to rewrite the actual terminal goals to make that one become instrumental.
        
        What would you choose as a replacement terminal goal, or would you not use one?
        [deleted]May 18, 2015, 9:41 PM
        1 point
        Parent
        Well, I guess you would write the terminal goal as quite a long statement, which would summarize the things involved in friendliness, but also include language about not going to extremes, laissez-faire, and so on. It would be vague and generous. And as part of the instrumental goal there would be a stipulation that the friendliness instrumental goal should trump all other instrumentals.
        
        I’m having a bit of a problem answering because there are peripheral assumptions about how such an AI would be made to function, which I don’t want to accidentally buy into, because I don’t think goals expressed in language statements work anyway. So I am treading on eggshells here.
        
        A simpler solution would simply be to scrap the idea of exceptional status for the terminal goal, and instead include massive contextual constraints as your guard against drift.
        TheAncientGeek May 20, 2015, 2:39 PM
        3 points
        Parent
        
        Well, I guess you would write the terminal goal as quite a long statement, which would summarize the things involved in friendliness, but also include language about not going to extremes, laissez-faire, and so on. It would be vague and generous.
        
        That gets close to “do it right”
        
        And as part of the instrumental goal there would be a stipulation that the friendliness instrumental goal should trump all other instrumentals.
        
        Which is an open doorway to an AI that kills everyone because of miscoded friendliness,
        
        If you want safety features, and you should, you would need them to override the ostensible purpose of the machine....they would be pointless otherwise....even the humble off switch works that way.
        
        A simpler solution would simply be to scrap the idea of exceptional status for the terminal goal, and instead include massive contextual constraints as your guard against drift.
        
        Arguably, those constraint would be a kind of negative goal.
- TheAncientGeek May 15, 2015, 6:38 PM
  2 points
  Parent
  
  I have actually thought about that a lot, too, and my conclusion is that we should not beat ourselves up trying to figure out precisely what the difference might be between these nuanced versions of the idea, because the people who are proposing this idea in the first place have not themselves been clear enough about what is meant
  
  They are clear that they don’t mean AIs rigid behaviour is the result of it assessing its own inferrential processes as infallible … that is what the controversy is all about..
  
  The problem is, we are talking about an AI, and some people talk as if the AI can run its planning engine, then feel compelled to obey the planning engine … while at the same time being fully cognizant of evidence that the planning engine produced a crappy plan.
  
  That is just what The Genie Knows but doesn’t Care is supposed to answer. I think it succeeds in showing that a fairly specific architecture would behave that way, but fails in it’s intended goal of showing that this behaviour is universal or likely.