Furcas comments on Debunking Fallacies in the Theory of AI Motivation

Furcas May 13, 2015, 4:58 AM
2 points
I honestly don’t know what more to write to make you understand that you misunderstand what Yudkowsky really means.

You may be suffering from a bad case of the Doctrine of Logical Infallibility, yourself.
- [deleted]May 13, 2015, 4:28 PM
  2 points
  Parent
  What you need to do is address the topic carefully, and eliminate the ad hominem comments like this:
  
  You may be suffering from a bad case of the Doctrine of Logical Infallibility, yourself.
  
  … which talk about me, the person discussing things with you.
  
  I will now examine the last substantial comment you wrote, above.
  
  Is the Doctrine of Logical Infallibility Taken Seriously?
  
  No, it’s not. The Doctrine of Logical Infallibility is indeed completely crazy, but Yudkowsky and Muehlhauser (and probably Omohundro, I haven’t read all of his stuff) don’t believe it’s true. At all.
  
  This is your opening topic statement. Fair enough.
  
  Yudkowsky believes that a superintelligent AI programmed with the goal to “make humans happy” will put all humans on dopamine drip despite protests that this is not what they want, yes.
  
  You are agreeing with what I say on this point, so we are in agreement so far.
  
  However, he doesn’t believe the AI will do this because it is absolutely certain of its conclusions past some threshold; he doesn’t believe that the AI will ignore the human’s protests, or fail to update its beliefs accordingly.
  
  You make three statements here, but I will start with the second one:
  
  … he doesn’t believe that the AI will ignore the human’s protests, …
  
  This is a contradiction of the previous paragraph, where you said “Yudkowsky believes that a superintelligent AI [...] will put all humans on dopamine drip despite protests that this is not what they want”.
  
  Your other two statements are that Yudkowsky is NOT saying that the AI will do this “because it is absolutely certain of its conclusions past some threshold”, and he is NOT saying that the AI will “fail to update its beliefs accordingly”.
  
  In the paper I have made a precise statement of what the “Doctrine of Logical Infallibility” means, and I have given references to show that the DLI is a summary of what Yudkowsky et al have been claiming. I have then given you a more detailed explanation of what the DLI is, so you can have it clarified as much as possible.
  
  If you look at every single one of the definitions I have given for the DLI you will see that they are all precisely true of what Yudkowsky says. I will now itemize the DLI into five components so we can find which component is inconsistent with what Yudkowsky has publicly said.
  
  1) The AI decides to do action X (forcing humans to go on a dopamine drip). Everyone agrees that Yudkowsky says this.
  
  2) The AI knows quite well that there is massive, converging evidence that action X is inconsistent with the goal statement Y that was supposed to justify X (where goal statement Y was something like “maximize human happiness”).
  
  This is a point that you and others repeatedly misunderstand or misconstrue, so before you respond to it, let me give details of the “converging evidence” that the AI will be getting:
  
  (a) Screams of protest from humans. “Screams of protest” are generally understood by all knowledgeable intelligent systems as evidence of extreme unhappiness, and evidence of extreme unhappiness is evidence that the goal “maximize human happiness” is not being fulfilled.
  
  (b) Verbalizations from humans that amount to “I am begging you not to do this!”. Such verbalizations are, again, usually considered to be evidence of extreme unhappiness caused by the possibiity that ‘this’ is going to be perpetrated.
  
  (c) Patient explanations by the humans that, even though dopamine induced ‘happiness’ might seem to maximize human happiness, the concept of ‘happiness’ exists only by reference to the complete array of desires expressed by humans, and there are many other aspects of happiness not being considered, which trump the dopamine plan. Once again, these patient explanations are a direct statement of the inconsistency of the dopamine plan and real human happiness.
  
  I could probably add to this list continuously, for several days, to document the sum total of all the evidence that the AI would be bombarded with, all pointing to the fact that the dopamine drip plan would be inconsistent with both its accumulated general knowledge about ‘happiness’, and the immediate evidence coming from the human population at that point.
  
  Now, does Yudkowsky believe that the AI will know about this evidence? I have not seen one single denial, by him or any of the others, that the AI will indeed be getting this evidence, and that it will understand that evidence completely. And, on the other hand, most people who read my paper agree that there it is quite clear, in the writings of Yudkowsky et al, that they do, positively, agree that the AI will know that this evidence of conflict exists. So this part of the definition of the DLI is also accepted by everyone.
  
  3) If a goal Y leads to a proposed plan, X, that is supposed to achieve goal Y, and yet there is “massive converging evidence” that this plan will lead to a situation that is drastically inconsistent with everything that the AI understands about the concepts referenced in the goal Y, this kind of massive inconsistency would normally be considered as grounds for supposing that there has been a failure in the mechanism that has caused the AI to propose plan X.
  
  Without exception, every AI programmer that I know, who works on real systems, has agreed that it is hard to imagine a clearer indication that something has gone wrong with the mechanism—either a run-time error of some kind, or a design-time programming error. These people (every one of them) go further and say that one of the most important features of ANY control system in an AI is that when it comes up with a candidate plan to satisfy goal Y it must do some sanity checks to see if the candidate plan is consistent with everything it knows about the goal. If those sanity checks start to detect the slightest inconsistency, the AI will investigate in more depth … and if the AI uncovers the kind of truly gigantic inconsistency between its background knowledge and the proposed plan that we have seen in the above, the AI would take the most drastic action possible to cease all activities and turn itself in for a debugging.
  
  This fact about candidate plans and sanity checks for consistency is considered so elementary that most AI programmers laugh at the idea that anyone could be so naive as to think of disagreeing with it. We can safely assume, then, that Yudkowsky is aware of this (indeed, as I wrote in the paper, he has explicitly said that he thinks this sanity checking mechanisms would be a good idea), so this third component of the DLI definition is also agreed by everyone.
  
  4) In addition to the safe-mode reaction just described in (3), the superintelligent AI being proposed by Yudkowsky and the others would be fully aware of the limitations of all real AI motivation engines, so it would know that a long chain of reasoning from a goal statement to a proposed action plan COULD lead to a proposed plan that was massively inconsistent with the both the system’s larger understanding of the meaning of the terms in the goal statement, and with immediate evidence coming in from the environment at that point.
  
  This knowledge of the AI, about the nature of its own design, is also not denied by anyone. To deny it would be to say that the AI really did not know very much about the design of AI systems design—a preposterous idea, since this is supposed to be a superintelligent system that has already been imvolved in its own redesign, and which is often assumed to be so intelligent that it can understand far more than all of the human race, combined.
  
  So, when the AI’s planning (goal & motivation) system sees the massive inconsistency between its candidate plan X and the terms used in the goal statement Y, it will (in addition to automatically putting itself into safe mode and calling for help) know that this kind of situation could very well be a result of those very limitations.
  
  In other words: the superintelligent AI will know that it is fallible.
  
  I have not seen anyone disagree with this, because it is such an elementary corollary of other known facts about AI systems, that is almost self-evident. So, once again, this component of the DLI definition is not disputed by anyone.
  
  5) Yudkowsky and the others state, repeatedly and in the clearest possible terms, that in spite of all of the above the superintelligent AI they are talking about would NOT put itself into safe mode, as per item 3 above, but would instead insist that ‘human happiness’ was defined by whatever emerged from its reasoning engine, and so it would go ahead and implement the plan X.
  
  Now, the definition—please note, the DEFINITION—of the idea that the postulated AI is following a “Doctrine of Logical Infallibility” is that the postulated AI will do what is described in item (5) above, and NOT do what is described in item (4) above.
  
  This is logically identical to the statement that the postulated AI will behave toward its planning mechanism (which includes its reasoning engine, since it needs to use the latter in the course of unpacking its goals and examining candidate plans) as if that planning mechanism is “infallible”, because it will be giving absolute priority to the output of that mechanism and NOT giving priority to the evidence coming from the consistency-checking mechanism, which is indicating that a failure of some kind has occurred in the planning mechanism.
  
  I do not know why the AI would do this—it is not me who is proposing that it would—but the purpose of the DLI is to encapsulate the proposal made by Yudkowsky and others, to the effect that SOMETHING in the AI makes it behave that way. That is all the DLI is: if an AI does what is described in (5), but not what is described in (4), and it does this in the context of (1), (2) and (3), then by definition it is following the DLI.