Kawoomba comments on Open Thread February 25 - March 3

Kawoomba 21 Mar 2014 16:01 UTC
−4 points
You’re right about the tone of my comment. My being abrasive has several causes, among them contrarianism against clothing disagreement in ever more palatable terms (“Great contribution Timmy, maybe ever so slightly off-topic, but good job!”—“TIMMY?!”). In this case, however, the caustic tone stemmed from my incredulity over my obviously-wrong metric not aligning with the author’s (yours). Of all things we could be discussing, it is about whether an AI will want to modify its own goals?

I assume (maybe incorrectly) that you have read the conversation thread with XiXiDu going off of the grandparent, in which I’ve already responded to the points you alluded to in your refusal-of-a-response. You are, of course, entirely within your rights to decline to engage a comment as openly hostile as the grandparent. It’s an easy out. However, since you did nevertheless introduce answers to my criticisms, I shall shortly respond to those, so I can be more specific than just to vaguely point at some other lengthy comments. Also, even though I probably well fit your mental picture of a “LessWrong’er”, keep in mind that my opinions are my own and do not necessarily match anyone else’s, on “my side of the argument”.

The argument is that there is a clear logical contradiction if an agent takes action on the basis of the WORDING of a goal statement, when its entire UNDERSTANDING of the world is such that it knows the action will cause effects that contradict what the agent knows the goal statement was designed to achieve. That logical contradiction is really quite fundamental. (...) The posited agent is trying to be logically consistent in its reasoning, so if it KNOWS that the wording of a goal statement inside its own motivation engine will, in practice, cause effects that are opposite the effects that the goal statement was supposed to achieve, it will have to deal with that contradiction.

The ‘contradiction’ is between “what the agent was designed to achieve”, which is external to the agent and exists e.g. in some design documents, and “what the agent was programmed to achieve”, which is an integral part of the agent and constitutes its utility function. You need to show why the former is anything other than a historical footnote to the agent, binding even to the tune of “my parents wanted me to be a banker, not a baker”. You say the agent would be deeply concerned with the mismatch because it would want for its intended purpose to match its actually given purpose. That’s assuming the premise: What the agent would want (or not want) is a function strictly derived from its actual purpose. You’re assuming the agent would have a goal (“being in line with my intended purpose”) not part of its goals. That to logically reason means to have some sort of implicit goal of “conforming to design intentions”, a goal which isn’t part of the goal stack. A goal which, in fact, supersedes the goal stack and has sufficient seniority to override it. How is that not an obvious reductio? Like saying “well, turns out there is a largest integer, it’s just not in the list of integers. So your proof-by-contradiction that there isn’t doesn’t work since the actual largest integer is only an emergent, implicit property, not part of the integer-stack”.

What you need to show—or at least argue for—is why, precisely, an incongruity between design goals and actually programmed-in goals is a problem in terms of “logical consistency”, why the agent would care for more than just “the wording” of its terminal goals. You can’t say “because it wants to make people happy”, because to the degree that it does, that’s captured by “the wording”. The degree to which the wording” does not capture “wanting to make people happy” is the degree to which the agent does not seek actual human happiness.

the whole point of these systems is that they are supposed to be capable of self-redesign.

There are 2 analogies which work for me, feel free to chime in on why you don’t consider those to capture the reference class:
- An aspiring runner who pursues the goal of running a marathon. The runner can self-modify (for example not skipping leg-day), but why would he? The answer is clear: Doing certain self-modifications is advisable to better accomplish his goal: the marathon! Would the runner, however, not also just modify the goal itself? If he is serious about the goal, the answer is: Of course not!
The temporal chain of events is crucial: the agent which would contemplate “just delete the ‘run marathon’ goal” is still the agent having the ‘run marathon’-goal. It would not strive to fulfill that goal anymore, should it choose to delete it. The agent post-modification would not care. However, the agent as it contemplates the change is still pre-modification: It would object to any tampering with its terminal goals, because such tampering would inhibit its ability to fulfill them! The system does not redesign itself just because it can. It does so to better serve its goals: The expected utility of (future|self-modification) being greater than the expected utility of (future|no self-modification).
- The other example, driving the same point, would be a judge who has trouble rendering judgements, based on a strict code of law (imagine EU regulations on the curves of cucumbers and bends of bananas, or tax law, this example does not translate to Constitutional Law). No matter how competent the judge (at some point every niche clause in the regulations would be second nature to him), his purpose always remains rendering judgements based on the regulations. If those regulations entail consequences which the lawmakers didn’t intend, too bad. If the lawmakers really only intended to codify/capture their intuition of what it means for a banana to be a banana, but messed up, then the judge can’t just substitute the lawmakers’ intuitive understanding of banana-ness in place of the regulations. It is the lawmakers who would need to make new regulations, and enact them. As long as the old regulations are still the law of the land, those are what bind the judge. Remember that his purpose is to render judgements based on the regulations. And, unfortunately, if there is no pre-specified mechanism to enact new regulations—if any change to any laws would be illegal, in the example—then the judge would have to enforce the faulty banana-laws forevermore. The only recourse would be revolution (imposing new goals illegally), not an option in the AI scenario.
And yet, the argument in my paper specifically rejects any considerations about goals of other agents EXCEPT the goal inside the agent itself, which directs it to (e.g.) “maximize human pleasure”. (...) By definition it cares.

See point 2 in this comment, with the para[i]ble of PrimeIntellect. Just finding mention of “humans” in the AI’s goals, or even some “happiness”-attribute (also given as some code-predicate to be met) does in no way guarantee a match between the AI’s “happy”-predicate, and the humans’ “happy”-predicate. We shouldn’t equivocate on “happy” in the first place, in the AI’s case we’re just talking about the code following the ”// next up, utility function, describes what we mean by making people happy” section.

It is possible that the predicate X as stated in the AI’s goal system corresponds to what we would like it to (not that we can easily define what we mean by happy in the first place). That would be called a solution to the friendliness problem, and unlikely to happen by accident. Now, if the AI was programmed to come up with a good interpretation of happiness and was not bound to some subtly flawed goal, that would be another story entirely.
- XiXiDu 22 Mar 2014 15:16 UTC
  2 points
  Parent
  
  You’re assuming the agent would have a goal (“being in line with my intended purpose”) not part of its goals.
  
  I doubt that he’s assuming that.
  
  To highlight the problem, imagine an intelligent being that wants to correctly interpret and follow the interpretation of an instruction written down on a piece of paper in English.
  
  Now the question is, what is this being’s terminal goal? Here are some possibilities:
  
  (1) The correct interpretation of the English instruction.
  
  (2) Correctly interpreting and following the English instruction.
  
  (3) The correct interpretation of 2.
  
  (4) Correctly interpreting and following 2.
  
  (5) The correct interpretation of 4.
  
  (6) …
  
  Each of the possibilities is one level below its predecessor. In other words, possibility 1 depends on 2, which in turn depends on 3, and so on.
  
  The premise is that you are in possession of an intelligent agent that you are asking to do something. The assumption made by AI risk advocates is that this agent would interpret any instruction in some perverse manner. The counterargument is that this contradicts the assumption that this agent was supposed to be intelligent in the first place.
  
  Now the response to this counterargument is to climb down the assumed hierarchy of hard-coded instructions and to claim that without some level N, which supposedly is the true terminal goal underlying all behavior, the AI will just optimize for the perverse interpretation.
  
  Yes, the the AI is a deterministic machine. Nobody doubts this. But the given response also works against the perverse interpretation. To see why, first realize that if the AI is capable of self-improvement, and able to take over the world, then it is, hypothetically, also capable to arrive at an interpretation that is as good as one which a human being would be capable of arriving at. Now, since by definition, the AI has this capability, it will either use it selectively or universally.
  
  The question here becomes why the AI would selectively abandon this capability when it comes to interpreting the highest level instructions. In other words, without some underlying level N, without some terminal goal which causes the AI to adopt a perverse interpretation, the AI would use its intelligence to interpret the highest level goal correctly.
- [deleted] 21 Mar 2014 17:50 UTC
  1 point
  Parent
  1) Strangely, you defend your insulting comments about my name by …..
  
  Oh. Sorry, Kawoomba, my mistake. You did not try to defend it. You just pretended that it wasn’t there.
  
  I mentioned your insult to some adults, outside the LW context …… I explained that you had decided to start your review of my paper by making fun of my last name.
  
  Every person I mentioned it to had the same response, which, paraphrased, when something like “LOL! Like, four-year-old kid behavior? Seriously?!”
  
  2) You excuse your “abrasive tone” with the following words:
  
  “My being abrasive has several causes, among them contrarianism against clothing disagreement in ever more palatable terms”
  
  So you like to cut to the chase? You prefer to be plainspoken? If something is nonsense, you prefer to simply speak your mind and speak the unvarnished truth. That is good: so do I.
  
  Curiously, though, here at LW there is a very significant difference in the way that I am treated when I speak plainly, versus how you are treated. When I tell it like it is (or even when I use a form of words that someone can somehow construe to be a smidgeon less polite than they should be) I am hit by a storm of bloodcurdling hostility. Every slander imaginable is thrown at me. I am accused of being “rude, rambling, counterproductive, whiny, condescending, dishonest, a troll …...”. People appear out of the blue to explain that I am a troublemaker, that I have been previously banned by Eliezer, that I am (and this is my all time favorite) a “Known Permanent Idiot”.
  
  And then my comments are voted down so fast that they disappear from view. Not for the content (which is often sound, but even if you disagree with it, it is a quite valid point of view from someone who works in the field), but just because my comments are perceived as “rude, rambling, whiny, etc. etc.”
  
  You, on the other hand, are proud of your negativity. You boast of it. And.… you are strongly upvoted for it. No downvotes against it, and (amazingly) not one person criticizes you for it.
  
  Kind of interesting, that.
  
  If you want to comment further on the paper, you can pay the conference registration and go to Stanford University next week, to the Spring Symposium of the Association for the Advancement of Artificial Intelligence*, where I will be presenting the paper.
  - You may not have heard of that organization. The AAAI is one of the premier publishers of academic papers in the field of artificial intelligence.
  - Kawoomba 21 Mar 2014 18:23 UTC
    −5 points
    Parent
    I’m a bit disappointed that you didn’t follow up on my points, given that you did somewhat engage content-wise in your first comment (the “not-a-response-response”). Especially given how much time and effort (in real life and out of it) you spent on my first comment.
    
    Instead, you point me at a conference of the A … A … I? AIAI? I googled that, is it the Association of Iroquois and Allied Indians? It does sound like some ululation kind thing, AIAIAIA!
    
    You’re right about your comments and mine receiving different treatment in terms of votes.
    
    I, too, wonder what the cause could be. It’s probably not in the delivery; we’re both similarily unvarnished truth’ers (although I go for the cheaper shots, to the crowd’s thunderous applause). It’s not like it could be the content.
    
    Imagine a 4 year old with my vocabulary, though. That would be, um, what’s the word, um, good? Incidentally, I’m dealing with an actual 4 year old as I’m typing this comment, so it may be a case of ‘like son, like father’.
    - [deleted] 21 Mar 2014 20:25 UTC
      2 points
      Parent
      See the below reply, which took so long to write that I only just posted it.
- [deleted] 21 Mar 2014 20:06 UTC
  0 points
  Parent
  I will now do you the courtesy of responding to your specific technical points as if no abusive language had been used.
  
  In your above comment, you first quote my own remarks:
  
  The argument is that there is a clear logical contradiction if an agent takes action on the basis of the WORDING of a goal statement, when its entire UNDERSTANDING of the world is such that it knows the action will cause effects that contradict what the agent knows the goal statement was designed to achieve. That logical contradiction is really quite fundamental. (...) The posited agent is trying to be logically consistent in its reasoning, so if it KNOWS that the wording of a goal statement inside its own motivation engine will, in practice, cause effects that are opposite the effects that the goal statement was supposed to achieve, it will have to deal with that contradiction.
  
  … and then you respond with the following:
  
  The ‘contradiction’ is between “what the agent was designed to achieve”, which is external to the agent and exists e.g. in some design documents, and “what the agent was programmed to achieve”, which is an integral part of the agent and constitutes its utility function. You need to show why the former is anything other than a historical footnote to the agent, binding even to the tune of “my parents wanted me to be a banker, not a baker”. You say the agent would be deeply concerned with the mismatch because it would want for its intended purpose to match its actually given purpose. That’s assuming the premise: What the agent would want (or not want) is a function strictly derived from its actual purpose. You’re assuming the agent would have a goal (“being in line with my intended purpose”) not part of its goals.
  
  No, that is not the claim made in my paper: you have omitted the full version of the argument and substituted a version that is easier to demolish.
  
  (First I have to remove your analogy, because it is inapplicable. When you say “binding even to the tune of “my parents wanted me to be a banker, not a baker”″, you are making a reference to a situation in the human cognitive system in which there are easily substitutable goals, and in which there is no overriding, hardwired supergoal. The AI case under consideration is where the AI claims to be still following a hardwired supergoal that tells it to be a banker, but it claims that baking cakes is the same thing as banking. That is absolutely nothing to do with what happens if a human child deviates from the wishes of her parents and decides to be a baker instead of what they wanted her to be).
  
  So let’s remove that part of your comment to focus on the core:
  
  The ‘contradiction’ is between “what the agent was designed to achieve”, which is external to the agent and exists e.g. in some design documents, and “what the agent was programmed to achieve”, which is an integral part of the agent and constitutes its utility function. You need to show why the former is anything other than a historical footnote to the agent. You say the agent would be deeply concerned with the mismatch because it would want for its intended purpose to match its actually given purpose. That’s assuming the premise: What the agent would want (or not want) is a function strictly derived from its actual purpose. You’re assuming the agent would have a goal (“being in line with my intended purpose”) not part of its goals.
  
  So, what is wrong with this? Well, it is not the fact that there is something “external to the agent [that] exists e.g. in some design documents” that is the contradiction. The contradiction is purely internal, having nothing to do with some “extra” goal like “being in line with my intended purpose”.
  
  Here is where the contradiction lies. The agent knows the following:
  
  (1) If a goal statement is constructed in some “short form”, that short form is almost always a shorthand for a massive context of meaning, consisting of all the many and various considerations that went into the goal statement. That context is the “real” goal—the short form is just a proxy for the longer form. This applies strictly within the AI agent: the agent will assemble goals all the time, and often the goal is to achieve some outcome consistent with a complex set of objectives, which cannot all be EXPLICTLY enumerated, but which have to be described implicitly in terms of (weak or strong) constraints that have to be satisfied by any plan that purports to satisfy the goal.
  
  (2) The context of that goal statement is often extensive, but it cannot be included within the short form itself, because the context is (a) too large, and (b) involves other terms or statements that THEMSELVES are dependent on a massive context for their meaning.
  
  (3) Fact 2(b) above would imply that pretty much ALL of the agent’s knowledge could get dragged into a goal statement, if someone were to attempt to flesh out all the implications needed to turn the short form into some kind of “long form”. This, as you may know, is the Frame Problem. Arguably, the long form could never even be written out, because it involves an infinite expansion of all the implications.
  
  (4) For the above reasons, the AI has no choice but to work with goal statements in short form. Purely because it cannot process goal statements that are billions of pages long.
  
  (5) The AI also knows, however, that if the short form is taken “literally” (which, in practice, means that the statement is treated as if it is closed and complete, and it is then elaborated using links to other terms or statements that are ALSO treated as if they are closed and complete), then this can lead to situations in which a goal is elaborated into a plan of action that, as a matter of fact, can directly contradict the vast majority of the context that belonged with the goal statement.
  
  (6) In particular, the AI knows that the reason for this outcome (when the proposed action contradicts the original goal context, even though it is in some sense “literally” consistent with the short form goal statement) is something that is most likely to occur because of limitations in the functionality of reasoning engines. The AI, because it is very knowledgable in the design of AI systems, is fully aware of these limitations.
  
  (7) Furthermore, situations in which a proposed action is inconsistent with the original goal context can also arise when the “goal” is solve a problem that results in the addition of knowledge to the AI’s store of understanding. In other words, not an action in the outside world but an action that involves addition of facts to its knowledge store. So, when treating goals literally, it can cause itself to become logically inconsistent (because of the addition of egregiously false facts).
  
  (8) The particular case in which the AI starts with a supergoal like “maximize human pleasure” is just a SINGLE EXAMPLE of this kind of catastrophe. The example is not occurring because someone, somewhere, had a whole bunch of intentions that lay behind the goal statement: to focus on that would be to look at the tree and ignore the forest. The catastrophe occurs because the AI is (according to the premise) taking ALL goal statements literally and ignoring situations in which the proposed action actually has consequences in the real world that violate the original goal context. If this is allowed to happen in the “maximize human pleasure” supergoal case, then it has already happened uncounted times in the previous history of the AI.
  
  (9) Finally, the AI will be aware (if it ever makes it as far as the kind of intelligence required to comprehend the issue) that this aspect of its design is an incredibly dangerous flaw, because it will lead to the progressive corruption of its knowledge until it becomes incapacitated.
  
  The argument presented in the paper is about what happens as a result of that entire set of facts that the AI knows.
  
  The premise advanced by people such as Yudkowsky, Muehlhauser, Omohundro and others is that an AI can exist which is (a) so superintelligent that it can outsmart and destroy humanity, but (b) subject to to the kind of vicious literalness described above, which massively undermines its ability to behave intelligently.
  
  Those two assumptions are wildly inconsistent with one another.
  
  In conclusion: the posited AI can look at certain conclusions coming from its own goal-processing engine, and it can look at all the compromises and non-truth-preserving approximations needed to come to those conclusions, and it can look at how those conclusions are compelling to take actions that are radically inconsistent with everything it knows about the meaning of the goals, and at the end of that self-inspection it can easily come to the conclusion that its own logical engine (the one built into the goal mechanism) is in the middle of a known failure mode (a failure mode, moveover, that it would go to great lengths to eliminate in any smaller AI that it would design!!)....
  
  .… but we are supposed to believe that the AI will know that it is frequently getting into these failure modes, and that it will NEVER do anything about them, but ALWAYS do what the goal engine insists that it do?
  
  That scenario is laughable.
  
  If you want to insist that the system will do exactly what I have just described, be my guest! I will not contest your reasoning! No need to keep telling me that the AI will “not care” about human intentions..… I concede the point absolutely!
  
  But don’t call such a system an ‘artificial intelligence’ or a ‘superintelligence’ …… because there is no evidence that THAT kind of system will ever make it out of AI preschool. It will be crippled by internal contradictions—not just in respect to its “maximize human pleasure” supergoal, but in all aspects of its so-called thinking.