Rohin Shah comments on Superintelligent Introspection: A Counter-argument to the Orthogonality Thesis

Rohin Shah 29 Aug 2021 8:32 UTC
14 points
The superintelligent agent must decide whether or not to execute the part of its own code telling it to reward itself for certain outcomes; as well as whether or not to add or subtract additional reward functions. It must realize that its capacity for self-modification gives it the power to alter the physical structure of its goal-device, and must come up with some reason to make these alternations or not to make them.
It sounds like you are running afoul of Ghosts in the Machine, though I’m not entirely sure exactly what you’re saying.
What links here?
- Steven Byrnes's comment on Superintelligent Introspection: A Counter-argument to the Orthogonality Thesis by DirectedEvolution (30 Aug 2021 0:11 UTC; 4 points)
- adamShimi 29 Aug 2021 10:33 UTC
  12 points
  Parent
  Another way of phrasing this criticism is that the OP is implicitly assuming an already aligned AGI.
  Sentences like the following are exactly the kind of reasoning errors that the orthogonality thesis is fighting against.
  Likewise, they may have designed the physical and energetic structures that instantiate its goal suboptimally
  Humans are able to detect a difference between representations of their goals and the goal itself. A superintelligent agent should likewise be able to grasp this distinction.
  Another is that it would recognize that the goal-representation it finds in its own structure or code was created by humans, and that its true goal should be to better understand what those humans intended.
  It’s also possible to imagine that the agent would modify its own tendency for relentless pursuit of its goal, which again makes it hard to predict the agent’s behavior.
  If you relentlessly pursue a goal, you’re not going to do some existential thinking to check whether it is truly the right goal—you’re going to relentlessly pursue the goal! The mere fact of doing this all existential meditation requires that this is part of the goal, and so that we managed some form of alignment that makes the AI care about its goal being right for some nice notion of right.
  Obviously, if your AI is already aligned with humans and our philosophical takes on the world, its goals won’t be any goal possible. But if you don’t use circular reasoning by assuming alignment, we have no reason to assume that an unaligned AI will realize that its goal isn’t what we meant, just like an erroneous Haskell program doesn’t realize it should compute the factorial in another way than what’s asked of it.
  - DirectedEvolution 29 Aug 2021 15:38 UTC
    2 points
    Parent
    The assumptions Bostrom uses to justify the orthogonality thesis include:
    If desire is required in order for beliefs to motivate actions, and if intelligence may produce belief, but not desire.
    ″… if the agent happens to have certain standing desires of some sufficient, overriding strength.”
    ″… if it is possible to build a cognitive system (or more neutrally, an “optimization process”) with arbitrarily high intelligence but with constitution so alien as to contain no clear functional analogues to what in humans we call “beliefs” and “desires”.
    ″… if an agent could have impeccable instrumental rationality even whilst lacking some other faculty constitutive of rationality proper, or some faculty required for the full comprehension of the objective moral facts.”
    First, let’s point out that the first three justifications use the word “desire,” rather than “goal.” So let’s rewrite the OT with this substitution:
    Intelligence and final desires are orthogonal axes along which possible agents can freely vary. In other words, more or less any level of intelligence could in principle be combined with more or less any final desire.
    Let’s accept the Humean theory of motivation, and agree that there is a fundamental difference between belief and desire. Nevertheless, if Bostrom is implicitly defining intelligence as “the thing that produces beliefs, but not desires,” then he is begging the question in the orthogonality thesis.
    Now, let’s consider the idea of “standing desires of some sufficient, overriding strength.” Though I could very easily be missing a place where Bostrom makes this connection, I haven’t found where Bostrom goes from proposing the existence of such standing desires to showing why this is compatible with any level of intelligence. By analogy, we can imagine a human capable of having an extremely powerful desire to consume some drug. We cannot take it for granted that some biomedical intervention that allowed them to greatly increase their level of intelligence would leave their desire to consume the drug unaltered.
    Bostrom’s AI with an alien constitution, possessing intelligence but not beliefs and desires, again begs the question. It implicitly defines “intelligence” in such a way that it is fundamentally different from a belief or a desire. Later, he refers to “intelligence” as “skill at prediction, planning, and means-ends reasoning in general.” It is hard to imagine how we could have means-ends reasoning without some sort of desire. This seems to me an equivocation.
    His last point, that an agent could be superintelligent without having impeccable instrumental rationality in every domain, is also incompatible with the orthogonality thesis as he describes it here. He says that more or less any level of intelligence could be combined with more or less any final desire. When he makes this point, he is saying that more or less any final desire is compatible with superintelligence, as long as we exclude the parts of intelligence that are incompatible with the desire. While we can accept that an AI could be superintelligent while failing to exhibit perfect rationality in every domain, the orthogonality thesis as stated encompasses a superintelligence that is perfectly rational in every domain.
    Rejecting this formulation of the orthogonality thesis is not simulatenously a rejection of the claim that superintelligent AI is a threat. It is instead a rejection of the claim that Bostrom has made a successful argument that there is a fundamental distinction between intelligence and goals, or between intelligence and desires.
    My original argument here was meant to go a little further, and illustrate why I think that there is an intrinsic connection between intelligence and desire, at least at a roughly human level of intelligence.