interstice comments on wrapper-minds are the enemy

interstice 17 Jun 2022 2:41 UTC
2 points

Using a superintelligence to optimize some explicit goal, i.e. giving it the ‘wrapper structure’, is an obvious and tractable way to direct it

Is it obvious and tractable? MIRI doesn’t currently seem to think so. Given that, it might be worth considering some alternative possibilities. Especially those that don’t involve the creation of a superpowered wrapper mind as a failure state.

and some superintelligences kill everyone anyways

The arguments for why superintelligences will kill everyone tend to route through those intelligences being or becoming wrapper minds. So if we had an AI architecture that was not a wrapper mind, or especially likely to become one, that might defuse some of those arguments’ force.
- lc 17 Jun 2022 2:48 UTC
  2 points
  Parent
  Is it obvious and tractable? MIRI doesn’t currently seem to think so.
  
  I disagree, MIRI thinks it’s obvious and tractable to predict the “end result” of creating a superintelligence with the wrong value function. It’s just not good.
  
  The arguments for why superintelligences will kill everyone tend to route through those intelligences being or becoming wrapper minds. So if we had an AI architecture that was not a wrapper mind, or especially likely to become one, that might defuse some of those arguments’ force.
  
  This is the kind of logic I talk about when I say “sound more like attempts to obfuscate the problem than serious proposals designed to verifiably prevent the end of the world”. It’s like the reasoning goes:
  - Mathematicians keep pointing out the ways superintelligences with explicit goals cause bad outcomes.
  - If we add complications, such as implicit goals, to the superintelligences, they can’t reason as analytically about them
  - Therefore superintelligences with implicit goals are “safe”
  The analytical reasoning is not the problem. The problem is humans have this very particular habitat they need for their survival and if we have a superintelligence running around waving their magic wand, casting random spells at things, we will probably not be able to survive. The problem is also that we need a surefire way of getting this superintelligence to wave its magic wand and prevent the other superintelligences from spawning. “What if we flip random bits in the superintelligence’s code” is not a solution either, for the same reason.
  - interstice 17 Jun 2022 3:11 UTC
    3 points
    Parent
    I disagree, MIRI thinks it’s obvious and tractable to predict the “end result” of creating a superintelligence
    
    But they don’t think it’s an obvious and tractable means of “directing” such a superintelligence towards our actual goals, which is what the sentence I was quoting was about.
    
    It’s like the reasoning goes
    
    I didn’t say any of that. I would rather summarize my position as:
    
    Mathematicians keep pointing out the ways superintelligences with explicit goals will lead to bad outcomes. They also claim that any powerful cognitive system will tend to have such goals
    
    But we seem to have lots of examples of powerful cognitive systems which don’t behave like explicit goal maximizers
    
    Therefore, perhaps we should try to design superintelligences which are also not explicit goal maximizers. And also re-analyze the conditions under which the mathematicians’ purported theorems hold so we can have a better picture of which cognitive systems will act like explicit goal maximizers under which circumstances.
    - lc 17 Jun 2022 3:19 UTC
      15 points
      Parent
      I think my real objection is that MIRI kind of agrees with the idea “don’t attempt to make a pure utility maximizer with a static loss function on the first try” and thus has tried to build systems that aren’t pure utility maximizers, like ones that are instead corrigible or have “chill”. They just kinda don’t work so far and anybody suggesting that they haven’t looked is being a bit silly.
      
      Instead, I wish someone suggesting this would actually concretely describe the properties they hope to gain by removing a value function, as I suspect the real answer is… corrigibility or chill. Saying “oh this pure utillity maximizer thing looks really hard let’s explore the space of all possible agent designs instead” isn’t really helpful—what are you looking to find and why is it safer?