Stuart_Armstrong comments on An overall schema for the friendly AI problems: self-referential convergence criteria

Stuart_Armstrong 27 Jul 2015 13:08 UTC
1 point
Let’s focus on a simple version, without the metaphors. We’re talking about an AI presenting humans with consequences of a particular decision, with humans then making the final decision to go along with it or not.

So what is happening is that various possible future worlds will be considered by the AI according to its desirability criteria, these worlds will be described to humans according to its description criteria, and humans will choose according to whatever criteria we use. So we have a combination of criteria that result in a final decision. A siren world is a world that ranks very high in these combined criteria but is actually nasty.

If we stick to that scenario and assume the AI is truthful, the main siren world generator is the ability of the AI to describe them in ways that sound very attractive to humans. Since human beliefs and preferences are not clearly distinct., this ranges from misleading (incorrect human beliefs) to actively seductive (influencing human preferences to favour these worlds).

The higher the bandwidth the AI has, the more chance it has of “seduction”, or of exploiting known or unknown human irrationalities (again, there’s often no clear distinction between exploiting irrationalities for beliefs or preferences).

One scenario—Paul Christiano’s—is a bit different but has essentially unlimited bandwidth (or, more precisely, has an AI estimating the result of a setup that has essentially unlimited bandwidth).

but also have a bunch of other extra-scary features above and beyond other scenarios of people being irrational, just because.

This category can include irrationalities we don’t yet know about, better exploitation of irrationalities we do know about, and a host of speculative scenarios about hacking the human brain, which I don’t want to rule out completely at this stage.
- [deleted] 28 Jul 2015 0:27 UTC
  −3 points
  Parent
  
  We’re talking about an AI presenting humans with consequences of a particular decision, with humans then making the final decision to go along with it or not.
  
  No. We’re not. That’s dumb. Like, sorry to be spiteful, but that is already a bad move. You do not treat any scenario involving “an AI”, without dissolving the concept, as desirable or realistic. You have “an AI”, without having either removed its “an AI”-ness (in the LW sense of “an AI”) entirely or guaranteed Friendliness? You’re already dead.
  - Stuart_Armstrong 28 Jul 2015 10:23 UTC
    1 point
    Parent
    Can we assume, that since I’ve been working all this time on AI safety, that I’m not an idiot? When presenting a scenario (“assume AI contained, and truthful”) I’m investigating whether we have safety within the terms of that scenario. Which here we don’t, so we can reject attempts aimed at that scenario without looking further. If/when we find a safe way to do that within the scenario, then we can investigate whether that scenario is achievable in the first place.
    - [deleted] 30 Jul 2015 15:56 UTC
      0 points
      Parent
      Ah. Then here’s the difference in assumptions: I don’t believe a contained, truthful UFAI is safe in the first place. I just have an incredibly low prior on that. So low, in fact, that I didn’t think anyone would take it seriously enough to imagine scenarios which prove it’s unsafe, because it’s just so bloody obvious that you do not build UFAI for any reason, because it will go wrong in some way you didn’t plan for.
      - Stuart_Armstrong 31 Jul 2015 8:34 UTC
        0 points
        Parent
        See the point on Paul Christiano’s design. The problem I discussed applies not only to UFAIs but to other designs that seek to get round it, but use potentially unrestricted search.