Rob Bensinger comments on AGI Ruin: A List of Lethalities

Rob Bensinger Jun 12, 2022, 3:10 AM
LW: 17 AF: 7
8
AF
I think most worlds that successfully navigate AGI risk have properties like:
- AI results aren’t published publicly, going back to more or less the field’s origin.
- The research community deliberately steers toward relatively alignable approaches to AI, which includes steering away from approaches that look like ‘giant opaque deep nets’.
  - This means that you need to figure out what makes an approach ‘alignable’ earlier, which suggests much more research on getting de-confused regarding alignable cognition.
    Many such de-confusions will require a lot of software experimentation, but the kind of software/ML that helps you learn a lot about alignment as you work with it is itself a relatively narrow target that you likely need to steer towards deliberately, based on earlier, weaker deconfusion progress. I don’t think having DL systems on hand to play with has helped humanity learn much about alignment thus far, and by default, I don’t expect humanity to get much more clarity on this before AGI kills us.
- Researchers focus on trying to predict features of future systems, and trying to get mental clarity about how to align such systems, rather than focusing on ‘align ELIZA’ just because ELIZA is the latest hot new thing. Make and test predictions, back-chain from predictions to ‘things that are useful today’, and pick actions that are aimed at steering — rather than just wandering idly from capabilities fad to capabilities fad.
  - (Steering will often fail. But you’ll definitely fail if you don’t even try. None of this is easy, but to date humanity hasn’t even made an attempt.)
- In this counterfactual world, deductive reasoners and expert systems were only ever considered a set of toy settings for improving our intuitions, never a direct path to AGI.
  - (I.e., the civilization was probably never that level of confused about core questions like ‘how much of cognition looks like logical deduction?’; their version of Aristotle or Plato, or at least Descartes, focused on quantitative probabilistic reasoning. It’s an adequacy red flag that our civilization was so confused about so many things going into the 20th century.)
To me, all of this suggests a world where you talk about alignment before you start seeing crazy explosions in capabilities. I don’t know what you mean by “we didn’t even have the concept of machine learning back then”, but I flatly don’t buy that the species that landed on the Moon isn’t capable of generating a (more disjunctive version of) the OP’s semitechnical concerns pre-AlexNet.
You need the norm of ‘be able to discuss things before you have overwhelming empirical evidence’, and you need the skill of ‘be good at reasoning about such things’, in order to solve alignment at all; so it’s a no-brainer that not-wildly-incompetent civilizations at least attempt literally any of this.
- Thomas Kwa Oct 4, 2022, 12:15 AM
  5 points
  0
  Parent
  “most worlds that successfully navigate AGI risk” is kind of a strange framing to me.
  For one thing, it represents p(our world | success) and we care about p(success | our world). To convert between the two you of course need to multiply by p(success) / p(our world). What’s the prior distribution of worlds? This seems underspecified.
  For another, using the methodology “think about whether our civilization seems more competent than the problem is hard” or “whether our civilization seems on track to solve the problem” I might have forecast nuclear annihilation (not sure about this).
  The methodology seems to work when we’re relatively certain about the level of difficulty on the mainline, so if I were more sold on that I would believe this more. It would still feel kind of weird though.