Rob Bensinger comments on Shah and Yudkowsky on alignment failures

Rob Bensinger 1 Mar 2022 9:31 UTC
LW: 4 AF: 3
AF
In that particular non-failure story, I’m definitely imagining that they aren’t “trying to win the debate” (where “trying” is a very strong word that implies taking over the world to win the debate).
Suppose I’m debating someone about gun control, and they say ‘guns don’t kill people; people kill people’. Here are four different scenarios for how I might respond:
- 1. Almost as a pure reflex, before I can stop myself, I blurt out ‘That’s bullshit!’ in response. It’s not the best way to win the debate, but heck, I’ve heard that zinger a thousand times and it just makes me so mad. (Or, in other words: I have an automatic reflex-like response to that specific phrase, which is to get mad; and when I get mad, I have an automatic reflex-like response to blurt out the first sentence I can think of that expresses disapproval for that slogan.)
Or, instead:
- 2. I remember that there’s a $1000 prize for winning this debate, and I take a deep breath to calm myself. I know that winning the debate will require convincing a judge who isn’t super sympathetic to my political views; so I’ll have to come up with some argument that’s convincing even to a conservative. My mind wanders for a few seconds, and a thought pops into my head: ‘Guns and people both kill people!’ Hmm, but that sounds sort of awkward and weak. Is there a more pithy phrasing?
  
  A memory suddenly pops into my head: I think I heard once that knife murders spiked in Australia or somewhere, when guns were banned? So, like, ‘People will kill people regardless of whether guns are present?’ Ugh, wait, that’s exactly the point my opponent was making. Moving the debate in that direction is a terrible idea if I want to win. And now I feel a bit bad for strategically steering my thoughts away from true information, but whatever…
  
  And now my mind is wandering, thinking about gun suicide, and… come on, focus. ‘Guns don’t kill people. People kill people.’ How to respond? Going concrete might make my response more compelling, by making me sound more grounded and common-sensical. Concretely, it’s just obvious common sense that giving someone more firepower will increase their ability to kill others; and, for example, it will make it likelier that someone kills someone else in a fit of passion, where they might not have committed murder if they’d been delayed a few minutes.
  
  Oh, hey, I can use that! I like how matter-of-fact that response is. And it will be more persuasive to the judge, because it’s not making any strong or outrageous-sounding claims, or building a big edifice of argument; it’s just making a simple challenge, which then puts the ball in the other side’s court and makes it seem like the burden of proof lies with them now. Anyway, I’m feeling tired after thinking this hard, and I’m running out of time, so let’s just go with that idea...
Or, instead:
- 3. Wait, why am I focusing so much on the $1000 prize for this TV show? Being on this show is an amazing opportunity: I could make way more than $1000 if I hijack the live broadcast to start promoting my business to the televised audience. Actually, what if I just tried to negotiate a deal with my debate opponent. Or, heck, with the producers...
Or, instead:
- 4. Sorry, I don’t have time to think about that debate question, I’m busy building a Dyson swarm to harvest the Sun’s energy so that I can make the future awesome. I… really don’t care about the $1000, no, relative to the larger stakes here.
If “trying” is a very strong word that literally implies you have to be trying to take over the world, then only scenario #4 involves me “trying” to win the debate. But I think it makes more sense to say that I’m trying in all four cases (or at least in cases #2, #3, and #4, where I’m displaying some strategy in deciding what to say).
You might then respond that we should try to build AI systems that are “trying” in the weak sense of #2, rather than in sense #3 or #4. But I think Eliezer and Steven’s point is that #2, #3, and #4 are on a continuum, rather than being qualitatively different.
(Even #1 is on the continuum in some respects, since my brain needs to be engaging in smart creative search processes somewhere in order to even generate strategies like ‘get mad in response to X’ or ‘find an angry-sounding thing to say in response when I get mad’.)
#2, #3, and #4 are all cases where I’m performing a search for strategies that will get me what I want, and where I evaluate various candidate responses to see how helpful they look. The difference between these options is in how wide a space of strategies I’m considering, and in how efficiently and intelligently I’m zeroing in on the highest-rated strategies in that space. (Where ‘highest-rated’ is relative to what I want.)
- Rohin Shah 1 Mar 2022 10:50 UTC
  LW: 4 AF: 3
  AF Parent
  I totally agree those are on a continuum. I don’t think this changes my point? It seems like Eliezer is confident that “reduce x-risk to EDIT: sub-50%” requires being all the way on the far side of that continuum, and I don’t see why that’s required.
  - So8res 1 Mar 2022 15:07 UTC
    LW: 6 AF: 4
    AF Parent
    (“near-zero” is a red herring, and I worry that that phrasing bolsters the incorrect view that the reason MIRI folk think alignment is hard is that we want implausibly strong guarantees. I suggest replacing “reduce x-risk to near-zero” with “reduce x-risk to sub-50%”.)
    - Rohin Shah 1 Mar 2022 15:39 UTC
      LW: 4 AF: 3
      AF Parent
      (Done)
- Rob Bensinger 1 Mar 2022 9:32 UTC
  LW: 2 AF: 1
  AF Parent
  If we have some way to limit an AI’s strategy space, or limit how efficiently and intelligently it searches that space, then we can maybe recapitulate some of the stuff that makes humans safe (albeit at the cost that the debate answers will probably be way worse — but maybe we can still get nanotech or whatever out of this process).
  If that’s the plan, then I guess my next question is how we should go about limiting the strategy space and/or reducing the search quality? (Taking into account things like deception risk.)
  Alternatively, maybe you think that something very reflex-like, a la #1, is sufficient for a pivotal act — no smart search for strategies at all. But surely there has to be smart search going on somewhere the system, or how is it doing a bunch of useful novel scientific work?
  - Rohin Shah 1 Mar 2022 11:02 UTC
    LW: 12 AF: 7
    AF Parent
    If we have some way to limit an AI’s strategy space, or limit how efficiently and intelligently it searches that space, then we can maybe recapitulate some of the stuff that makes humans safe (albeit at the cost that the debate answers will probably be way worse — but maybe we can still get nanotech or whatever out of this process).
    If that’s the plan, then I guess my next question is how we should go about limiting the strategy space and/or reducing the search quality? (Taking into account things like deception risk.)
    It sounds like you think my position is “here is my plan to save the world and I have a story for how it will work”, whereas my actual view is “here is a story in which humanity is stupid and covers itself in shame by taking on huge amounts of x-risk (e.g. 5%), where we have no strong justification for being confident that we’ll survive, but the empirical situation ends up being such that we survive anyway”.
    In this story, I’m not imagining that we limited the strategy space of reduced the search quality. I’m imagining that we just scaled up capabilities, used debate without any bells and whistles like interpretability, and the empirical situation just happened to be that the AI systems didn’t develop #4-style “trying” (but did develop #2-style “trying”) before they became capable enough to e.g. establish a stable governance regime that regulates AI development or do alignment research better than any existing human alignment researchers that leads to a solution that we can be justifiably confident in.
    My sense is that Eliezer would say that this story is completely implausible, i.e. this hypothesized empirical situation is ruled out by knowledge that Eliezer has. But I don’t know what knowledge rules this out. (I’m pretty sure it has to do with his intuitions about a Core of General Intelligence, and/or why capabilities generalize faster than alignment, but I don’t know where those intuitions come from, nor do I share them.)
    Alternatively, maybe you think that something very reflex-like, a la #1, is sufficient for a pivotal act
    Idk, I’m also worried about sufficiently scaled-up reflex-like things, in the sense that I think sufficiently scaled-up reflex-like things are capable both of pivotal acts and causing human extinction. But on my prediction of what actually happens I expect at least #2-style reasoning before reducing x-risk to ~zero (because that’s more efficient than scaled-up reflex-like things).
    - dxu 1 Mar 2022 15:46 UTC
      LW: 29 AF: 13
      AF Parent
      In this story, I’m not imagining that we limited the strategy space of reduced the search quality. I’m imagining that we just scaled up capabilities, used debate without any bells and whistles like interpretability, and the empirical situation just happened to be that the AI systems didn’t develop #4-style “trying” (but did develop #2-style “trying”) before they became capable enough to e.g. establish a stable governance regime that regulates AI development or do alignment research better than any existing human alignment researchers that leads to a solution that we can be justifiably confident in.
      You (a human) already exhibit #2-style trying. Despite this, you are not capable of “establishing a stable governance regime that regulates AI development” or “doing alignment research better than any existing human alignment researchers” (the latter is tautologically true, even).
      So it seems reasonable to conclude that this level of “trying” is not enough to enact the pivotal acts you described (or, indeed, most any pivotal act that we might recognize as “pivotal”). It then follows that if a system is capable enough to enact some such pivotal act, some part of that system must have been running a stronger search than the kind of search described in “#2-style trying”. And if you buy Eliezer’s/Nate’s argument that it’s the search itself that’s dangerous, rather than the fact that you (maybe) wrapped up the search in an outer shell you happen to call “oracle AI” (or something), then it’s not a large jump from there to “maybe the search decides ‘killing all humans’ rates highly according to its search criteria”.
      But perhaps you’re conceptualizing this whole “trying” thing differently, because you go on to say:
      Idk, I’m also worried about sufficiently scaled-up reflex-like things, in the sense that I think sufficiently scaled-up reflex-like things are capable both of pivotal acts and causing human extinction. But on my prediction of what actually happens I expect at least #2-style reasoning before reducing x-risk to ~zero (because that’s more efficient than scaled-up reflex-like things).
      which actually just does not parse in my native ontology. Like, in my ontology “sufficiently scaled-up reflex-like things” stop behaving reflexively. It’s not that you have this abstract label “reflex-like”, that you can slap onto some system, such that if you then scale that system up the label stays stuck to it indefinitely; in my model reflexiveness is a property of actions, not of systems, and if you make a system sufficiently powerful it leaves the regime where reflex-like behavior is its default. It automatically goes from #1 to #2 to #3 to #4 in the limit of sufficient scaling; this is, from my perspective, what is meant by the claim “these things exist on a continuum” (which claim it seems like you agreed with in a parallel comment thread, which simply furthers my confusion).
      - Eliezer Yudkowsky 1 Mar 2022 22:52 UTC
        LW: 11 AF: 5
        AF Parent
        (I endorse dxu’s entire reply.)
      - Rohin Shah 1 Mar 2022 16:08 UTC
        LW: 7 AF: 6
        AF Parent
        So it seems reasonable to conclude that this level of “trying” is not enough to enact the pivotal acts you described
        Stated differently than how I’d say it, but I agree that a single human performing human-level reasoning is not enough to enact those pivotal acts.
        in my model reflexiveness is a property of actions,
        Yeah, in my ontology (and in this context) reflexiveness is a property of cognitions, not of actions. I can reflexively reach into a transparent pipe to pick up a sandwich, without searching over possible plans for getting the sandwich (or at least, without any conscious search, and without any search via trying different plans and seeing if they work); one random video I’ve seen suggests that (some kind of) monkeys struggle to do this and may have to experiment with different plans to get the food. (I use this anecdote as an illustration; I don’t know if it is actually true.)
        See also the first few sections of Argument, intuition, and recursion; in the language of that post I’m thinking of “explicit argument” as “trying”, and “intuition” as “reflex-like”, even though they output the same thing.
        Within my ontology, you could define behavioral-reflexivity as those behaviors / actions that a human could do with reflexive cognition, and then more competent actions are behavioral-trying. These concepts might match yours. In that case I’m saying that it’s plausible that there’s a wide gap between behavioral-trying-2 and behavioral-trying-3, but really my intuition is coming much more from finding the trying-2 cognitions significantly more likely than the trying-3 cognitions, and thinking that the trying-2 cognitions could scale without becoming trying-3 cognitions.
        Or, to try and say things a bit more concretely, I find it plausible that there is more scaling from improving the efficiency of the search (e.g. by having better tuned heuristics and intuitions), than from expanding the domain of possible plans considered by the search. The 4 styles of trying that Rob mentioned exist on a continuum like “domain of possible plans”, but instead we mostly walk up the continuum of “efficiency / competence of search within the domain”.
        (The resulting world looks more like CAIS than like a singular superintelligence with a DSA.)
        (And I’ll reiterate again because I anticipate being misunderstood that this is not a prediction of how the world must be and thus we are obviously safe; it is instead a story that I think is not ruled out by our current understanding and thus one to which I assign non-trivial probability.)
        What links here?
        dxu's comment on Late 2021 MIRI Conversations: AMA / Discussion by Rob Bensinger (1 Mar 2022 20:56 UTC; 6 points)
  - Vanessa Kosoy 1 Mar 2022 15:07 UTC
    LW: 2 AF: 2
    AF Parent
    
    If that’s the plan, then I guess my next question is how we should go about limiting the strategy space and/or reducing the search quality? (Taking into account things like deception risk.)
    
    I suggested doing this using quantilization.