[deleted] comments on Open Thread February 25 - March 3

[deleted] 21 Mar 2014 20:06 UTC
0 points
I will now do you the courtesy of responding to your specific technical points as if no abusive language had been used.

In your above comment, you first quote my own remarks:

The argument is that there is a clear logical contradiction if an agent takes action on the basis of the WORDING of a goal statement, when its entire UNDERSTANDING of the world is such that it knows the action will cause effects that contradict what the agent knows the goal statement was designed to achieve. That logical contradiction is really quite fundamental. (...) The posited agent is trying to be logically consistent in its reasoning, so if it KNOWS that the wording of a goal statement inside its own motivation engine will, in practice, cause effects that are opposite the effects that the goal statement was supposed to achieve, it will have to deal with that contradiction.

… and then you respond with the following:

The ‘contradiction’ is between “what the agent was designed to achieve”, which is external to the agent and exists e.g. in some design documents, and “what the agent was programmed to achieve”, which is an integral part of the agent and constitutes its utility function. You need to show why the former is anything other than a historical footnote to the agent, binding even to the tune of “my parents wanted me to be a banker, not a baker”. You say the agent would be deeply concerned with the mismatch because it would want for its intended purpose to match its actually given purpose. That’s assuming the premise: What the agent would want (or not want) is a function strictly derived from its actual purpose. You’re assuming the agent would have a goal (“being in line with my intended purpose”) not part of its goals.

No, that is not the claim made in my paper: you have omitted the full version of the argument and substituted a version that is easier to demolish.

(First I have to remove your analogy, because it is inapplicable. When you say “binding even to the tune of “my parents wanted me to be a banker, not a baker”″, you are making a reference to a situation in the human cognitive system in which there are easily substitutable goals, and in which there is no overriding, hardwired supergoal. The AI case under consideration is where the AI claims to be still following a hardwired supergoal that tells it to be a banker, but it claims that baking cakes is the same thing as banking. That is absolutely nothing to do with what happens if a human child deviates from the wishes of her parents and decides to be a baker instead of what they wanted her to be).

So let’s remove that part of your comment to focus on the core:

The ‘contradiction’ is between “what the agent was designed to achieve”, which is external to the agent and exists e.g. in some design documents, and “what the agent was programmed to achieve”, which is an integral part of the agent and constitutes its utility function. You need to show why the former is anything other than a historical footnote to the agent. You say the agent would be deeply concerned with the mismatch because it would want for its intended purpose to match its actually given purpose. That’s assuming the premise: What the agent would want (or not want) is a function strictly derived from its actual purpose. You’re assuming the agent would have a goal (“being in line with my intended purpose”) not part of its goals.

So, what is wrong with this? Well, it is not the fact that there is something “external to the agent [that] exists e.g. in some design documents” that is the contradiction. The contradiction is purely internal, having nothing to do with some “extra” goal like “being in line with my intended purpose”.

Here is where the contradiction lies. The agent knows the following:

(1) If a goal statement is constructed in some “short form”, that short form is almost always a shorthand for a massive context of meaning, consisting of all the many and various considerations that went into the goal statement. That context is the “real” goal—the short form is just a proxy for the longer form. This applies strictly within the AI agent: the agent will assemble goals all the time, and often the goal is to achieve some outcome consistent with a complex set of objectives, which cannot all be EXPLICTLY enumerated, but which have to be described implicitly in terms of (weak or strong) constraints that have to be satisfied by any plan that purports to satisfy the goal.

(2) The context of that goal statement is often extensive, but it cannot be included within the short form itself, because the context is (a) too large, and (b) involves other terms or statements that THEMSELVES are dependent on a massive context for their meaning.

(3) Fact 2(b) above would imply that pretty much ALL of the agent’s knowledge could get dragged into a goal statement, if someone were to attempt to flesh out all the implications needed to turn the short form into some kind of “long form”. This, as you may know, is the Frame Problem. Arguably, the long form could never even be written out, because it involves an infinite expansion of all the implications.

(4) For the above reasons, the AI has no choice but to work with goal statements in short form. Purely because it cannot process goal statements that are billions of pages long.

(5) The AI also knows, however, that if the short form is taken “literally” (which, in practice, means that the statement is treated as if it is closed and complete, and it is then elaborated using links to other terms or statements that are ALSO treated as if they are closed and complete), then this can lead to situations in which a goal is elaborated into a plan of action that, as a matter of fact, can directly contradict the vast majority of the context that belonged with the goal statement.

(6) In particular, the AI knows that the reason for this outcome (when the proposed action contradicts the original goal context, even though it is in some sense “literally” consistent with the short form goal statement) is something that is most likely to occur because of limitations in the functionality of reasoning engines. The AI, because it is very knowledgable in the design of AI systems, is fully aware of these limitations.

(7) Furthermore, situations in which a proposed action is inconsistent with the original goal context can also arise when the “goal” is solve a problem that results in the addition of knowledge to the AI’s store of understanding. In other words, not an action in the outside world but an action that involves addition of facts to its knowledge store. So, when treating goals literally, it can cause itself to become logically inconsistent (because of the addition of egregiously false facts).

(8) The particular case in which the AI starts with a supergoal like “maximize human pleasure” is just a SINGLE EXAMPLE of this kind of catastrophe. The example is not occurring because someone, somewhere, had a whole bunch of intentions that lay behind the goal statement: to focus on that would be to look at the tree and ignore the forest. The catastrophe occurs because the AI is (according to the premise) taking ALL goal statements literally and ignoring situations in which the proposed action actually has consequences in the real world that violate the original goal context. If this is allowed to happen in the “maximize human pleasure” supergoal case, then it has already happened uncounted times in the previous history of the AI.

(9) Finally, the AI will be aware (if it ever makes it as far as the kind of intelligence required to comprehend the issue) that this aspect of its design is an incredibly dangerous flaw, because it will lead to the progressive corruption of its knowledge until it becomes incapacitated.

The argument presented in the paper is about what happens as a result of that entire set of facts that the AI knows.

The premise advanced by people such as Yudkowsky, Muehlhauser, Omohundro and others is that an AI can exist which is (a) so superintelligent that it can outsmart and destroy humanity, but (b) subject to to the kind of vicious literalness described above, which massively undermines its ability to behave intelligently.

Those two assumptions are wildly inconsistent with one another.

In conclusion: the posited AI can look at certain conclusions coming from its own goal-processing engine, and it can look at all the compromises and non-truth-preserving approximations needed to come to those conclusions, and it can look at how those conclusions are compelling to take actions that are radically inconsistent with everything it knows about the meaning of the goals, and at the end of that self-inspection it can easily come to the conclusion that its own logical engine (the one built into the goal mechanism) is in the middle of a known failure mode (a failure mode, moveover, that it would go to great lengths to eliminate in any smaller AI that it would design!!)....

.… but we are supposed to believe that the AI will know that it is frequently getting into these failure modes, and that it will NEVER do anything about them, but ALWAYS do what the goal engine insists that it do?

That scenario is laughable.

If you want to insist that the system will do exactly what I have just described, be my guest! I will not contest your reasoning! No need to keep telling me that the AI will “not care” about human intentions..… I concede the point absolutely!

But don’t call such a system an ‘artificial intelligence’ or a ‘superintelligence’ …… because there is no evidence that THAT kind of system will ever make it out of AI preschool. It will be crippled by internal contradictions—not just in respect to its “maximize human pleasure” supergoal, but in all aspects of its so-called thinking.