Rob Bensinger comments on AGI Ruin: A List of Lethalities

Rob Bensinger 6 Jun 2022 10:37 UTC
LW: 4 AF: 3
0
AF
I don’t think I personally could have written it; if others think they could have, I’d genuinely be interested to hear them brag, even if they can’t prove it.
Maybe the ideal would be ‘I generated the core ideas of [a,b,c] with little or no argument from others; I had to be convinced of [d,e,f] but I now agree with them; I disagree with [g,h,i]; I think you left out important considerations [x,y,z].’ Just knowing people’s self-model is interesting to me, I don’t demand that everything you believe be immediately provable to me.
- johnswentworth 6 Jun 2022 15:38 UTC
  LW: 70 AF: 22
  24
  AF Parent
  I think as of early this year (like, January/February, before I saw a version of this doc) I could have produced a pretty similar list to this one. I definitely would not derive it from the empty string in the closest world-without-Eliezer; I’m unsure how much I’d pay attention to AI alignment at all in that world. I’d very likely be working on agent foundations in that world, but possibly in the context of biology or AI capabilities rather than alignment. Arguments about AI foom and doom were obviously-to-me correct once I paid attention to them at all, but not something I’d have paid attention to on my own without someone pointing them out.
  Some specifics about kind-of-doc I could have written early this year
  - The framing around pivotal acts specifically was new-to-me when the late 2021 MIRI conversations were published. Prior to that, I’d have had to talk about how weak wish-granters are safe but not actually useful, and if we want safe AI which actually grants big wishes then we have to deal with the safety problems. Pivotal acts framing simplifies that part of the argument a lot by directly establishing a particular “big” capability which is necessary.
  - By early this year, I think would have generated pretty similar points to basically everything in the post if I were trying to be really comprehensive. (In practice, writing a post like this, I would go for more unifying structure and thought-generators rather than comprehensiveness; I’d use the individual failure modes more as examples of their respective generators.)
  - In my traversal-order of barriers, the hard conceptual barriers for which we currently have no solution even in principle (like e.g. 16-19) would get a lot more weight and detail; I spend less time thinking about what-I-mentally-categorize-as “the obvious things which go wrong with stupid approaches” (20, 21, 25-36).
    Just within the past week, this post on interpretability was one which would probably turn into a point on my equivalent of Eliezer’s list.
  - The earlier points are largely facts-about-the-world (e.g. 1, 2, 7-9, 12-15). For many of these, I would cite different evidence, although the conclusions remain the same. True facts are, as a general rule, overdetermined by evidence; there are many paths to them, and I didn’t always follow the same paths Eliezer does here.
  - A few points I think are wrong (notably 18, 22, 24 to a limited extent), but are correct relative to the knowledge/models which most proposals actually leverage. The loopholes there are things which you do need pretty major foundational/conceptual work to actually steer through.
  - I would definitely have generated some similar rants at the end, though of course not identical.
    One example: just yesterday I was complaining about how people seem to generate alignment proposals via a process of (1) come up with neat idea, (2) come up with some conditions under which that idea would maybe work (or at least not obviously fail in any of the ways the person knows to look for), (3) either claim that “we just don’t know” whether the conditions hold (without any serious effort to look for evidence), or directly look for evidence that they hold. Pretty standard bottom line failure.
  I did briefly consider writing something along these lines after Eliezer made a similar comment to 39 in the Late 2021 MIRI Conversations. But as Kokotajlo guessed, I did not think that was even remotely close to the highest-value use of my time. It would probably take me a full month’s work to do it right, and the list just isn’t as valuable as my last month of progress. Or the month before that. Or the month before that.
  - Ruby 6 Jun 2022 20:51 UTC
    LW: 10 AF: 2
    3
    AF Parent
    I’m curious about why you decided it wasn’t worth your time.
    
    Going from the post itself, the case for publishing it goes something like “the whole field of AI Alignment is failing to produce useful work because people aren’t engaging with what’s actually hard about the problem and are ignoring all the ways their proposals are doomed; perhaps yelling at them via this post might change some of that.”
    
    Accepting the premises (which I’m inclined to), trying to get the entire field to correct course seems actually pretty valuable, maybe even worth a month of your time, now that I think about it.
    - johnswentworth 6 Jun 2022 21:43 UTC
      LW: 39 AF: 15
      9
      AF Parent
      First and foremost, I have been making extraordinarily rapid progress in the last few months, though most of that is not yet publicly visible.
      Second, a large part of why people pour effort into not-very-useful work is that the not-very-useful work is tractable. Useless, but at least you can make progress on the useless thing! Few people really want to work on problems which are actually Hard, so people will inevitably find excuses to do easy things instead. As Eliezer himself complains, writing the list just kicks the can down the road; six months later people will have a new set of bad ideas with giant gaping holes in them. The real goal is to either:
      produce people who will identify the holes in their own schemes, repeatedly, until they converge to work on things which are actually useful despite being Hard, or
      get enough of a paradigm in place that people can make legible progress on actually-useful things without doing anything Hard.
      I have recently started testing out methods for the former, but it’s the sort of thing which starts out with lots of tests on individuals or small groups to see what works. The latter, of course, is largely what my technical research is aimed at in the medium term.
      (I also note that there will always be at least some need for people doing the Hard things, even once a paradigm is established.)
      In the short term, if people want to identify the holes in their own schemes and converge to work on actually useful things, I think the “builder/breaker” methodology that Paul uses in the ELK doc is currently a good starting point.
    - Thane Ruthenis 7 Jun 2022 4:54 UTC
      8 points
      1
      Parent
      Well, it’s the Law of Continued Failure, as Eliezer termed it himself, no? There’s already been a lot of rants about the real problems of alignment and how basically no-one focuses on them, most of them Eliezer-written as well. The sort of person who wasn’t convinced/course-corrected by previous scattered rants isn’t going to be course-corrected by a giant post compiling all the rants in one place. Someone to whom this post would be of use is someone who’ve already absorbed all the information contained in it from other sources; someone who can already write it up on their own.
      The picture may not be quite as grim as that, but yeah I can see how writing it would not be anyone’s top priority.
  - lc 6 Jun 2022 17:36 UTC
    −7 points
    −2
    Parent
    I definitely would not derive it from the empty string in the closest world-without-Eliezer; I’m unsure how much I’d pay attention to AI alignment at all in that world. I’d very likely be working on agent foundations in that world, but possibly in the context of biology or AI capabilities rather than alignment. Arguments about AI foom and doom were obviously-to-me correct once I paid attention to them at all, but not something I’d have paid attention to on my own without someone pointing them out.
    I don’t think he does this; that’d be ridiculous.
    “I can’t find any good alignment researchers. The only way I know how to find them is by explaining that the field is important, using arguments for AI risk and doomerism, which means they didn’t come up with those arguments on their own, and thus cannot be ‘worthy’.”
    - Rob Bensinger 6 Jun 2022 21:20 UTC
      3 points
      0
      Parent
      I don’t think he does this; that’d be ridiculous.
      Doesn’t do what? I understand Eliezer to be saying that he figured out AI risk via thinking things through himself (e.g., writing a story that involved outcome pumps; reflecting on orthogonality and instrumental convergence; etc.), rather than being argued into it by someone else who was worried about AI risk. If Eliezer didn’t do that, there would still presumably be someone prior to him who did that, since conclusions and ideas have to enter the world somehow. So I’m not understanding what you’re modeling as ridiculous.
      (I don’t know that foom falls into the same category; did Vinge or I.J. Good’s arguments help persuade EY here?)
      “I can’t find any good alignment researchers. The only way I know how to find them is by explaining that the field is important, using arguments for AI risk and doomerism, which means they didn’t come up with those arguments on their own, and thus cannot be ‘worthy’.”
      This is phrased in a way that’s meant to make the standard sound unfair or impossible. But it seems like a perfectly fine Bayesian update:
      There’s no logical necessity that we live in a world that lacks dozens of independent “Eliezers” who all come up with this stuff and write about it. I think Nick Bostrom had some AI risk worries independently of Eliezer, so gets at least partial credit on this dimension. Others who had thoughts along these lines independently include Norbert Wiener and I.J. Good (timeline with more examples).
      You could imagine a world that has much more independent discovery on this topic, or one where all the basic concepts of AI risk were being widely discussed and analyzed back in the 1960s. It’s a fair Bayesian update to note that we don’t live in worlds that are anything like that, even if it’s not a fair test of individual ability for people who, say, encountered all of Eliezer’s writing as soon as they even learned about the concept of AI.
      (I could also imagine a world where more of the independent discoveries result in serious research programs being launched, rather than just resulting in someone writing a science fiction story and then moving on with a shrug!)
      Your summary leaves out that “coming up with stuff without needing to be argued into it” is a matter of degree, and that there are many important claims here beyond just ‘AI risk is worth paying attention to at all’.
      It’s logically possible to live in a world where people need to have AI risk brought to their attention, but then they immediately “get it” when they hear the two-sentence version, rather than needing an essay-length or seven-essay-length explanation. To the extent we live in a world where many key players need the full essay, and many other smart, important people don’t even “get it” after hours of conversation (e.g., LeCun), that’s a negative update about humanity’s odds of success.
      Similarly, it’s logically possible to live in a world where people needed persuading to accept the core ‘AI risk’ thing, but then they have an easy time generating all the other important details and subclaims themselves. “Maximum doom” and “minimum doom” aren’t the only options; the exact level of doominess matters a lot.
      E.g., my Eliezer-model thinks that nearly all public discussion of ‘practical implications of logical decision theory’ outside of MIRI (e.g., discussion of humans trying to acausally trade with superintelligences) has been utterly awful. If instead this discourse had managed to get a ton of stuff right even though EY wasn’t releasing much of his own detailed thoughts about acausal trade, then that would have been an important positive update.
      Eliezer spent years alluding to his AI risk concerns on Overcoming Bias without writing them all up, and deliberately withheld many related arguments for years (including as recently as last year) in order to test whether anyone else would generate them independently. It isn’t the case that humanity had to passively wait to hear the full argument from Eliezer before it was permitted for them to start thinking and writing about this stuff.
      - riceissa 6 Jun 2022 22:05 UTC
        15 points
        11
        Parent
        Doesn’t do what? I understand Eliezer to be saying that he figured out AI risk via thinking things through himself (e.g., writing a story that involved outcome pumps; reflecting on orthogonality and instrumental convergence; etc.), rather than being argued into it by someone else who was worried about AI risk. If Eliezer didn’t do that, there would still presumably be someone prior to him who did that, since conclusions and ideas have to enter the world somehow. So I’m not understanding what you’re modeling as ridiculous.
        My understanding of the history is that Eliezer did not realize the importance of alignment at first, and that he only did so later after arguing about it online with people like Nick Bostrom. See e.g. this thread. I don’t know enough of the history here, but it also seems logically possible that Bostrom could have, say, only realized the importance of alignment after conversing with other people who also didn’t realize the importance of alignment. In that case, there might be a “bubble” of humans who together satisfy the null string criterion, but no single human who does.
        The null string criterion does seem a bit silly nowadays since I think the people who would have satisfied it would have sooner read about AI risk on e.g. LessWrong. So they wouldn’t even have the chance to live to age ~21 to see if they spontaneously invent the ideas.
      - lc 6 Jun 2022 22:01 UTC
        4 points
        2
        Parent
        Look, maybe you’re right. But I’m not good at complicated reasoning; I can’t confidently verify these results you’re giving me. My brain is using a much simpler heuristic that says: look at all of these other fields with core insights that could have been made way earlier than they did. Look at Newton! Look at Darwin! Certainly game theorists could have come along a lot sooner. But that doesn’t mean only the founder of these fields is the one Great enough to make progress, so, what are you saying, exactly?
- Steven Byrnes 7 Jun 2022 3:53 UTC
  LW: 39 AF: 12
  23
  AF Parent
  I have a couple object-level disagreements including relevance of evolution / nature of inner alignment problem and difficulty of attaining corrigibility. But leaving those aside, I wouldn’t have exactly written this kind of document myself, because I’m not quite sure what the purpose is. It seems to be trying to do a lot of different things for different audiences, where I think more narrowly-tailored documents would be better.
  So, here are four useful things to do, and whether I’m personally doing them:
  First, there is a mass of people who think AGI risk is trivial and stupid and p(doom) ≈ 0, and they can gleefully race to build AGI, or do other things that will speed the development of AGI (like improve PyTorch, or study the neocortex), and they can totally ignore the field of AGI safety, and when they have AGI algorithms they can mess around with them without a care in the world.
  It would be very good to convince those people that AGI control is a serious and hard and currently-unsolved (and interesting!) problem, and that p(doom) will remain high (say, >>10%) unless and until we solve it.
  I think this is a specific audience that warrants a narrowly-tailored document, e.g. avoiding jargon and addressing the basics very well.
  That’s a big part of what I was going for in this post, for example. (And more generally, that whole series.)
  Second, there are people who are thoughtful and well-informed about AGI risk in general, but not sold on the “pivotal act” idea. If they had an AGI, they would do things that pattern-match to “cautious scientists doing very careful experiments in a dangerous domain”, but they would not do things that pattern-match to “aggressively and urgently use their new tool to prevent the imminent end of the world, by any means necessary, even if it’s super-illegal and aggressive and somewhat dangerous and everyone will hate them”.
  (I’m using “pivotal act” in a slightly broader sense that also includes “giving a human-level AGI autonomy to undergo recursive self-improvement and invent and deploy its own new technology”, since the latter has the same sort of dangerous properties and aggressive feel about it as a proper “pivotal act”.)
  (Well, it’s possible that there are people sold on the “pivotal act” idea who wouldn’t say it publicly.)
  Last week I did a little exercise of trying to guess p(doom), conditional on the two assumptions in this other comment. I got well over 99%, but I noted with interest that only a minority of my p(doom) was coming from “no one knows how to keep an AGI under control” (which I’m less pessimistic about than Eliezer, heck maybe I’m even as high as 20% that we can keep an AGI under control :-P , and I’m hoping that further research will increase that), whereas a majority of my p(doom) was coming from “there will be cautious responsible actors who will follow the rules and be modest and not do pivotal acts, and there will also be some reckless actors who will create out-of-control omnicidal AGIs”.
  So it seems extremely important to figure out whether a “pivotal act” is in fact necessary for a good future. And if it is (a big “if”!), then it likewise seems extremely important to get relevant decisionmaking people on board with that.
  I think it would be valuable to have a document narrowly tailored to this topic, finding the cruxes and arguments and counter-arguments etc. For example, I think this is a topic that looks very different in a Paul-Christriano-style future (gradual multipolar takeoff, near-misses, “corrigible AI assistants”, “strategy stealing assumption”, etc.) then in the world that I expect (decisive first-mover advantage).
  But I don’t really feel qualified to write anything like that myself, at least not before talking to lots of people, and it also might be the kind of thing that’s better as a conversation than a blog post.
  Third, there are people (e.g. leadership at OpenAI & DeepMind) making decisions that trade off between “AGI is invented soon” versus “AGI is invented by us people who are at least trying to avoid catastrophe and be altruistic”. Insofar as I think they’re making bad tradeoffs, I would like to convince them of that.
  Again, it would be useful to have a document narrowly tailored to this topic. I’m not planning to write one, but perhaps I’m sorta addressing it indirectly when I share my idiosyncratic models of exactly what technical work I think needs to be done before we can align an AGI.
  Fourth, there are people who have engaged with the AGI alignment / safety literature / discourse but are pursuing directions that will not solve the problem. It would be very valuable to spread common knowledge that those approaches are doomed. But if I were going to do that, it would (again) be a separate narrowly-tailored document, perhaps either organized by challenge that the approaches are not up to the task of solving, or organized by research program that I’m criticizing, naming names. I have dabbled in this kind of thing (example), but don’t have any immediate plan to do it more, let alone systematically. I think that would be extremely time-consuming.
- evhub 8 Jun 2022 22:56 UTC
  LW: 37 AF: 14
  8
  AF Parent
  It’s very clear to me I could have written this if I had wanted to—and at the very least I’m sure Paul could have as well. As evidence: it took me ~1 hour to list off all the existing sources that cover every one of these points in my comment.
  - Eliezer Yudkowsky 9 Jun 2022 0:23 UTC
    LW: 22 AF: 4
    5
    AF Parent
    Well, there’s obviously a lot of points missing! And from the amount this post was upvoted, it’s clear that people saw the half-assed current form as valuable.
    Why don’t you start listing out all the missing further points, then? (Bonus points for any that don’t trace back to my own invention, though I realize a lot of people may not realize how much of this stuff traces back to my own invention.)
    - evhub 9 Jun 2022 0:38 UTC
      LW: 4 AF: 4
      2
      AF Parent
      I’m not sure what you mean by missing points? I only included your technical claims, not your sociological ones, if that’s what you mean.
      - ESRogs 9 Jun 2022 1:25 UTC
        13 points
        7
        Parent
        I think he means that there are more points that could be made. (If the points in the post are the training set, can you also produce the points in the held-out test set?)
- lc 6 Jun 2022 17:30 UTC
  22 points
  8
  Parent
  I don’t think I personally could have written it; if others think they could have, I’d genuinely be interested to hear them brag, even if they can’t prove it.
  Maybe I’m beyond hopeless: I don’t even understand the brag inherent in having written it. He keeps talking about coming up with this stuff “from the null string”, but… Isn’t 90% of this post published somewhere else? If someone else had written it wouldn’t he just accuse them of not being able to write it without reading {X}, or something from someone else who read {X}? At present this is mostly a test of recall.
  Edit: Not to say I could’ve done even that, just that I expect someone else could have.
  - Yitz 7 Jun 2022 6:05 UTC
    5 points
    4
    Parent
    The post honestly slightly decreases my confidence in EY’s social assessment capabilities. (I say slightly because of past criticism I’ve had along similar lines). [note here that being good/bad at social assessment is not necessarily correlated to being good/bad at other domains, so like, I don’t see that as taking away from his extremely valid criticism of common “simple solutions” to alignment (which I’ve definitely been guilty of myself). Please don’t read this as denigrating Eliezer’s general intellect or work as a whole.] As you said, the post doesn’t seem incredibly original, and even if it is and we’re both totally missing that aspect, the fact that we’re missing it implies it isn’t getting across the intended message as effectively as it could. Ultimately, I think if I was in Eliezer’s position, there are a very large number of alternative explanations I’d give higher probability to than assuming that there is nobody in the world as competent as I am.
- swift_spiral 7 Jun 2022 1:04 UTC
  6 points
  2
  Parent
  When you say you don’t think you could have written it, do you mean that you couldn’t have written it without all the things you’ve learned from talking to Yudkowsky, or that you couldn’t have written it even now? Most of this list was things I’ve seen Yudkowsky write before, so if it’s the latter that surprises me.
- niplav 7 Jun 2022 23:06 UTC
  1 point
  0
  Parent
  Can I claim a very small but non-zero amount of bragging rights for having written this? It was at the time the ~only text about BCIs and alignment.
  
  I don’t think I could have written the above text in a world where zero people worried about alignment. I also did not bother to write anything more about it because it looked to me that everything relevant was already written up on the Arbital alignment domain.