Rohin Shah comments on Shah and Yudkowsky on alignment failures

Rohin Shah 1 Mar 2022 8:40 UTC
LW: 4 AF: 4
AF
So my objection to debate (which again I think is similar to Eliezer’s) would be: (1) if the debaters are “trying to win the debate” in a way that involves RL-on-thoughts / consequentialist planning / etc., then in all likelihood they would think up the strategy of breaking out of the box and hacking into the judge / opposing debater / etc. (2) if not, I don’t think the AIs would be sufficiently capable that they could do anything pivotal.
In that particular non-failure story, I’m definitely imagining that they aren’t “trying to win the debate” (where “trying” is a very strong word that implies taking over the world to win the debate).
I didn’t really get into this with Eliezer but like Richard I’m pretty unclear on why “not trying to win the debate” (with the strong sense of trying) implies “insufficiently capable to be pivotal”. I don’t think humans are “trying” in the strong sense, but we sure seem very capable; it doesn’t seem crazy to imagine that this continues.
If I’m reading Rohin correctly, he was gearing up to argue that the claim
I wasn’t really gearing up to argue anything. For most of this conversation I was in the mode of “what is the argument that convinces Eliezer of near-certain doom (rather than just suggesting it is plausible), because I don’t see it”.
- Steven Byrnes 1 Mar 2022 14:40 UTC
  LW: 4 AF: 4
  AF Parent
  The RL-on-thoughts discussion was meant as an argument that a sufficiently capable AI needs to be “trying” to do something. If we agree on that part, then you can still say my previous comment was a false dichotomy, because the AI could be “trying” to (for example) “win the debate while following the spirit of the debate rules”.
  And yes I agree, that was bad of me to have listed those two things as if they’re the only two options.
  I guess what I was thinking was: If we take the most straightforward debate setup, and if it gets an AI that is “trying” to do something, then that “something” is most likely to be vaguely like “win the debate” or something else with similarly-destructive consequences.
  A different issue is whether that “most likely” is 99.9% vs 80% or whatever—that part is not immediately obvious to me.
  And yet another question is whether we can push that probability much lower, even towards zero, by not using the most straightforward debate setup, but rather adding things to the setup that are directly targeted at sculpting the AGI’s motivations.
  I am not in fact convinced of near-certain doom there—that would be my Consequentialism & Corrigibility post. (I am convinced that we don’t have a good plan right now.)
  What links here?
  - Steven Byrnes's comment on Shah and Yudkowsky on alignment failures by Rohin Shah (1 Mar 2022 5:00 UTC; 12 points)
  - Rohin Shah 1 Mar 2022 15:36 UTC
    LW: 4 AF: 4
    AF Parent
    I agree that we don’t have a plan that we can be justifiably confident in right now.
    I don’t see why the “destructive consequences” version is most likely to arise, especially since it doesn’t seem to arise for humans. (In terms of Rob’s continuum, humans seem much closer to #2-style trying.)
    - Steven Byrnes 1 Mar 2022 16:24 UTC
      LW: 2 AF: 2
      AF Parent
      Again I don’t have an especially strong opinion about what our prior should be on possible motivation systems for an AGI trained by straightforward debate, and in particular what fraction of those motivation systems are destructive. But I guess I’m sufficiently confident in “>50% chance that it’s destructive” that I’ll argue for that. I’ll assume the AGI uses model-based RL, which would (I claim) put it very roughly in the same category as humans.
      Some aspects of motivation have an obvious relationship / correlation to the reward signal. In the human case, given that we’re a social animal, we can’t be surprised to find that the human brainstem reward function inserts lots of socially-related motivations into us, including things like caring about other humans (which sometimes generalizes to caring about other living creatures) and generally wanting to fit in and follow norms under most circumstances, etc. Whereas other things in the world have no relationship to the innate human brainstem reward function, and predictably, basically no one cares about them, except insofar as they become instrumentally useful for something else we do care about. (There are interesting rare exceptions, like human superstitions.) An example in humans would be the question of whether pebbles on the sidewalk are more often an even number of centimeters apart versus an odd number of centimeters apart.
      In the straightforward debate setup, I can’t see any positive reason for the reward function to directly paint a valence, either positive or negative, onto the idea of the AGI taking over the world. So I revert to the default expectation that the AGI will view “I take over the world” in a way that’s analogous to how humans view “the pebbles on the sidewalk are an even number of centimeters apart”—i.e., totally neutral, except insofar as it becomes instrumentally relevant for something else. Meanwhile, the reward signal is directly painting positive valence onto some aspect(s) of winning the debate. It’s hard to say exactly what that aspect will be—in fact I think it will be at least somewhat random. But whatever it is, it seems to me to be >50% likely that the AGI can get more of it by taking over the world. I might get as high as “>80%” or “>90%” before I start shrugging and saying “I don’t really know”.
      (Then we can start talking about capability windows etc., but I don’t think that was your objection here.)
      What links here?
      Steven Byrnes's comment on Safety timelines: How long will it take to solve alignment? by Esben Kran (19 Sep 2022 22:26 UTC; 6 points)
      - Rohin Shah 2 Mar 2022 10:20 UTC
        LW: 5 AF: 5
        AF Parent
        But I guess I’m sufficiently confident in “>50% chance that it’s destructive” that I’ll argue for that.
        Fwiw 50% on doom in the story I told seems plausible to me; maybe I’m at 30% but that’s very unstable. I don’t think we disagree all that much here.
        Then we can start talking about capability windows etc., but I don’t think that was your objection here.
        Capability windows are totally part of the objection. If you completely ignore capability windows / compute restrictions then you just run AIXI (or AIXI-tl if you don’t want something uncomputable) and die immediately.
        What links here?
        Safety timelines: How long will it take to solve alignment? by Esben Kran (EA Forum; 19 Sep 2022 12:51 UTC; 45 points)
        Safety timelines: How long will it take to solve alignment? by Esben Kran (19 Sep 2022 12:53 UTC; 37 points)
        Rohin Shah's comment on Safety timelines: How long will it take to solve alignment? by Esben Kran (30 Sep 2022 6:44 UTC; 4 points)
- Rob Bensinger 1 Mar 2022 9:31 UTC
  LW: 4 AF: 3
  AF Parent
  In that particular non-failure story, I’m definitely imagining that they aren’t “trying to win the debate” (where “trying” is a very strong word that implies taking over the world to win the debate).
  Suppose I’m debating someone about gun control, and they say ‘guns don’t kill people; people kill people’. Here are four different scenarios for how I might respond:
  - 1. Almost as a pure reflex, before I can stop myself, I blurt out ‘That’s bullshit!’ in response. It’s not the best way to win the debate, but heck, I’ve heard that zinger a thousand times and it just makes me so mad. (Or, in other words: I have an automatic reflex-like response to that specific phrase, which is to get mad; and when I get mad, I have an automatic reflex-like response to blurt out the first sentence I can think of that expresses disapproval for that slogan.)
  Or, instead:
  - 2. I remember that there’s a $1000 prize for winning this debate, and I take a deep breath to calm myself. I know that winning the debate will require convincing a judge who isn’t super sympathetic to my political views; so I’ll have to come up with some argument that’s convincing even to a conservative. My mind wanders for a few seconds, and a thought pops into my head: ‘Guns and people both kill people!’ Hmm, but that sounds sort of awkward and weak. Is there a more pithy phrasing?
    
    A memory suddenly pops into my head: I think I heard once that knife murders spiked in Australia or somewhere, when guns were banned? So, like, ‘People will kill people regardless of whether guns are present?’ Ugh, wait, that’s exactly the point my opponent was making. Moving the debate in that direction is a terrible idea if I want to win. And now I feel a bit bad for strategically steering my thoughts away from true information, but whatever…
    
    And now my mind is wandering, thinking about gun suicide, and… come on, focus. ‘Guns don’t kill people. People kill people.’ How to respond? Going concrete might make my response more compelling, by making me sound more grounded and common-sensical. Concretely, it’s just obvious common sense that giving someone more firepower will increase their ability to kill others; and, for example, it will make it likelier that someone kills someone else in a fit of passion, where they might not have committed murder if they’d been delayed a few minutes.
    
    Oh, hey, I can use that! I like how matter-of-fact that response is. And it will be more persuasive to the judge, because it’s not making any strong or outrageous-sounding claims, or building a big edifice of argument; it’s just making a simple challenge, which then puts the ball in the other side’s court and makes it seem like the burden of proof lies with them now. Anyway, I’m feeling tired after thinking this hard, and I’m running out of time, so let’s just go with that idea...
  Or, instead:
  - 3. Wait, why am I focusing so much on the $1000 prize for this TV show? Being on this show is an amazing opportunity: I could make way more than $1000 if I hijack the live broadcast to start promoting my business to the televised audience. Actually, what if I just tried to negotiate a deal with my debate opponent. Or, heck, with the producers...
  Or, instead:
  - 4. Sorry, I don’t have time to think about that debate question, I’m busy building a Dyson swarm to harvest the Sun’s energy so that I can make the future awesome. I… really don’t care about the $1000, no, relative to the larger stakes here.
  If “trying” is a very strong word that literally implies you have to be trying to take over the world, then only scenario #4 involves me “trying” to win the debate. But I think it makes more sense to say that I’m trying in all four cases (or at least in cases #2, #3, and #4, where I’m displaying some strategy in deciding what to say).
  You might then respond that we should try to build AI systems that are “trying” in the weak sense of #2, rather than in sense #3 or #4. But I think Eliezer and Steven’s point is that #2, #3, and #4 are on a continuum, rather than being qualitatively different.
  (Even #1 is on the continuum in some respects, since my brain needs to be engaging in smart creative search processes somewhere in order to even generate strategies like ‘get mad in response to X’ or ‘find an angry-sounding thing to say in response when I get mad’.)
  #2, #3, and #4 are all cases where I’m performing a search for strategies that will get me what I want, and where I evaluate various candidate responses to see how helpful they look. The difference between these options is in how wide a space of strategies I’m considering, and in how efficiently and intelligently I’m zeroing in on the highest-rated strategies in that space. (Where ‘highest-rated’ is relative to what I want.)
  - Rohin Shah 1 Mar 2022 10:50 UTC
    LW: 4 AF: 3
    AF Parent
    I totally agree those are on a continuum. I don’t think this changes my point? It seems like Eliezer is confident that “reduce x-risk to EDIT: sub-50%” requires being all the way on the far side of that continuum, and I don’t see why that’s required.
    - So8res 1 Mar 2022 15:07 UTC
      LW: 6 AF: 4
      AF Parent
      (“near-zero” is a red herring, and I worry that that phrasing bolsters the incorrect view that the reason MIRI folk think alignment is hard is that we want implausibly strong guarantees. I suggest replacing “reduce x-risk to near-zero” with “reduce x-risk to sub-50%”.)
      - Rohin Shah 1 Mar 2022 15:39 UTC
        LW: 4 AF: 3
        AF Parent
        (Done)
  - Rob Bensinger 1 Mar 2022 9:32 UTC
    LW: 2 AF: 1
    AF Parent
    If we have some way to limit an AI’s strategy space, or limit how efficiently and intelligently it searches that space, then we can maybe recapitulate some of the stuff that makes humans safe (albeit at the cost that the debate answers will probably be way worse — but maybe we can still get nanotech or whatever out of this process).
    If that’s the plan, then I guess my next question is how we should go about limiting the strategy space and/or reducing the search quality? (Taking into account things like deception risk.)
    Alternatively, maybe you think that something very reflex-like, a la #1, is sufficient for a pivotal act — no smart search for strategies at all. But surely there has to be smart search going on somewhere the system, or how is it doing a bunch of useful novel scientific work?
    - Rohin Shah 1 Mar 2022 11:02 UTC
      LW: 12 AF: 7
      AF Parent
      If we have some way to limit an AI’s strategy space, or limit how efficiently and intelligently it searches that space, then we can maybe recapitulate some of the stuff that makes humans safe (albeit at the cost that the debate answers will probably be way worse — but maybe we can still get nanotech or whatever out of this process).
      If that’s the plan, then I guess my next question is how we should go about limiting the strategy space and/or reducing the search quality? (Taking into account things like deception risk.)
      It sounds like you think my position is “here is my plan to save the world and I have a story for how it will work”, whereas my actual view is “here is a story in which humanity is stupid and covers itself in shame by taking on huge amounts of x-risk (e.g. 5%), where we have no strong justification for being confident that we’ll survive, but the empirical situation ends up being such that we survive anyway”.
      In this story, I’m not imagining that we limited the strategy space of reduced the search quality. I’m imagining that we just scaled up capabilities, used debate without any bells and whistles like interpretability, and the empirical situation just happened to be that the AI systems didn’t develop #4-style “trying” (but did develop #2-style “trying”) before they became capable enough to e.g. establish a stable governance regime that regulates AI development or do alignment research better than any existing human alignment researchers that leads to a solution that we can be justifiably confident in.
      My sense is that Eliezer would say that this story is completely implausible, i.e. this hypothesized empirical situation is ruled out by knowledge that Eliezer has. But I don’t know what knowledge rules this out. (I’m pretty sure it has to do with his intuitions about a Core of General Intelligence, and/or why capabilities generalize faster than alignment, but I don’t know where those intuitions come from, nor do I share them.)
      Alternatively, maybe you think that something very reflex-like, a la #1, is sufficient for a pivotal act
      Idk, I’m also worried about sufficiently scaled-up reflex-like things, in the sense that I think sufficiently scaled-up reflex-like things are capable both of pivotal acts and causing human extinction. But on my prediction of what actually happens I expect at least #2-style reasoning before reducing x-risk to ~zero (because that’s more efficient than scaled-up reflex-like things).
      - dxu 1 Mar 2022 15:46 UTC
        LW: 29 AF: 13
        AF Parent
        In this story, I’m not imagining that we limited the strategy space of reduced the search quality. I’m imagining that we just scaled up capabilities, used debate without any bells and whistles like interpretability, and the empirical situation just happened to be that the AI systems didn’t develop #4-style “trying” (but did develop #2-style “trying”) before they became capable enough to e.g. establish a stable governance regime that regulates AI development or do alignment research better than any existing human alignment researchers that leads to a solution that we can be justifiably confident in.
        You (a human) already exhibit #2-style trying. Despite this, you are not capable of “establishing a stable governance regime that regulates AI development” or “doing alignment research better than any existing human alignment researchers” (the latter is tautologically true, even).
        So it seems reasonable to conclude that this level of “trying” is not enough to enact the pivotal acts you described (or, indeed, most any pivotal act that we might recognize as “pivotal”). It then follows that if a system is capable enough to enact some such pivotal act, some part of that system must have been running a stronger search than the kind of search described in “#2-style trying”. And if you buy Eliezer’s/Nate’s argument that it’s the search itself that’s dangerous, rather than the fact that you (maybe) wrapped up the search in an outer shell you happen to call “oracle AI” (or something), then it’s not a large jump from there to “maybe the search decides ‘killing all humans’ rates highly according to its search criteria”.
        But perhaps you’re conceptualizing this whole “trying” thing differently, because you go on to say:
        Idk, I’m also worried about sufficiently scaled-up reflex-like things, in the sense that I think sufficiently scaled-up reflex-like things are capable both of pivotal acts and causing human extinction. But on my prediction of what actually happens I expect at least #2-style reasoning before reducing x-risk to ~zero (because that’s more efficient than scaled-up reflex-like things).
        which actually just does not parse in my native ontology. Like, in my ontology “sufficiently scaled-up reflex-like things” stop behaving reflexively. It’s not that you have this abstract label “reflex-like”, that you can slap onto some system, such that if you then scale that system up the label stays stuck to it indefinitely; in my model reflexiveness is a property of actions, not of systems, and if you make a system sufficiently powerful it leaves the regime where reflex-like behavior is its default. It automatically goes from #1 to #2 to #3 to #4 in the limit of sufficient scaling; this is, from my perspective, what is meant by the claim “these things exist on a continuum” (which claim it seems like you agreed with in a parallel comment thread, which simply furthers my confusion).
        Eliezer Yudkowsky 1 Mar 2022 22:52 UTC
        LW: 11 AF: 5
        AF Parent
        (I endorse dxu’s entire reply.)
        Rohin Shah 1 Mar 2022 16:08 UTC
        LW: 7 AF: 6
        AF Parent
        So it seems reasonable to conclude that this level of “trying” is not enough to enact the pivotal acts you described
        Stated differently than how I’d say it, but I agree that a single human performing human-level reasoning is not enough to enact those pivotal acts.
        in my model reflexiveness is a property of actions,
        Yeah, in my ontology (and in this context) reflexiveness is a property of cognitions, not of actions. I can reflexively reach into a transparent pipe to pick up a sandwich, without searching over possible plans for getting the sandwich (or at least, without any conscious search, and without any search via trying different plans and seeing if they work); one random video I’ve seen suggests that (some kind of) monkeys struggle to do this and may have to experiment with different plans to get the food. (I use this anecdote as an illustration; I don’t know if it is actually true.)
        See also the first few sections of Argument, intuition, and recursion; in the language of that post I’m thinking of “explicit argument” as “trying”, and “intuition” as “reflex-like”, even though they output the same thing.
        Within my ontology, you could define behavioral-reflexivity as those behaviors / actions that a human could do with reflexive cognition, and then more competent actions are behavioral-trying. These concepts might match yours. In that case I’m saying that it’s plausible that there’s a wide gap between behavioral-trying-2 and behavioral-trying-3, but really my intuition is coming much more from finding the trying-2 cognitions significantly more likely than the trying-3 cognitions, and thinking that the trying-2 cognitions could scale without becoming trying-3 cognitions.
        Or, to try and say things a bit more concretely, I find it plausible that there is more scaling from improving the efficiency of the search (e.g. by having better tuned heuristics and intuitions), than from expanding the domain of possible plans considered by the search. The 4 styles of trying that Rob mentioned exist on a continuum like “domain of possible plans”, but instead we mostly walk up the continuum of “efficiency / competence of search within the domain”.
        (The resulting world looks more like CAIS than like a singular superintelligence with a DSA.)
        (And I’ll reiterate again because I anticipate being misunderstood that this is not a prediction of how the world must be and thus we are obviously safe; it is instead a story that I think is not ruled out by our current understanding and thus one to which I assign non-trivial probability.)
        What links here?
        dxu's comment on Late 2021 MIRI Conversations: AMA / Discussion by Rob Bensinger (1 Mar 2022 20:56 UTC; 6 points)
    - Vanessa Kosoy 1 Mar 2022 15:07 UTC
      LW: 2 AF: 2
      AF Parent
      
      If that’s the plan, then I guess my next question is how we should go about limiting the strategy space and/or reducing the search quality? (Taking into account things like deception risk.)
      
      I suggested doing this using quantilization.