habryka comments on AGI safety from first principles: Introduction

habryka 8 Oct 2020 22:28 UTC
LW: 5 AF: 2
AF
Promoted to curated: I really enjoyed reading through this sequence. I have some disagreements with it, but overall it’s one of the best plain language introductions to AI safety that I’ve seen, and I expect I will link to this as a good introduction many times in the future. I was also particularly happy with how the sequence bridged and synthesized a number of different perspectives that usually feel in conflict with each other.
- Ben Pace 8 Oct 2020 22:52 UTC
  LW: 4 AF: 1
  AF Parent
  Critch recently made the argument (and wrote it in his ARCHES paper, summarized by Rohin here) that “AI safety” is a straightforwardly misleading name because “safety” is a broader category than is being talked about in (for example) this sequence – it includes things like not making self-driving cars crash. (To quote directly: “the term “AI safety” should encompass research on any safety issue arising from the use of AI systems, whether the application or its impact is small or large in scope”.) I wanted to raise the idea here and ask Richard what he thinks about renaming it to something like “AI existential safety from first principles” or “AI as an extinction risk from first principles” or “AI alignment from first principles”.
  - Richard_Ngo 11 Oct 2020 9:55 UTC
    LW: 3 AF: 1
    AF Parent
    Yeah, this seems like a reasonable point. But I’m not that much of a fan of the alternatives you suggest. What do you think about “AGI safety”?
    - Ben Pace 12 Oct 2020 2:49 UTC
      LW: 7 AF: 4
      AF Parent
      Oli suggests that there are no fields with three-word-names, and so “AI Existential Risk” is not a choice. I think “AI Alignment” is the currently most accurate name for the field that encompasses work like Paul’s and Vanessa’s and Scott/Abram’s and so on. I think “AI Alignment From First Principles” is probably a good name for the sequence.
    - Ben Pace 11 Oct 2020 21:14 UTC
      LW: 3 AF: 1
      AF Parent
      It seems a definite improvement on the axis of specificity, I do prefer it over the status quo for that reason.
      But it doesn’t address the problem of scope-sensitivity. I don’t think this sequence is about preventing medium-sized failures from AGI. It’s about preventing extinction-level risks to our future.
      “A First-Principles Explanation of the Extinction-Level Threat of AGI: Introduction”
      “The AGI Extinction Threat from First Principles: Introduction”
      “AGI Extinction From First Principles: Introduction”
      - Richard_Ngo 16 Oct 2020 1:15 UTC
        LW: 4 AF: 2
        AF Parent
        Yeah, I agree that’s a problem. Bur I don’t think it’s a big problem, because who’s talking about medium-size risks from AGI?
        
        In particular, the flag I want to plant is something like: “when you’re talking about AGI, it’s going to be So Big that existential safety is the default type of safety to be concerned with.”
        
        Also I think having the big EXTINCTION in the title costs weirdness points, because even within the field people don’t use that word very much. So I’m leaning towards AGI safety.
        Kaj_Sotala 10 Nov 2020 14:41 UTC
        LW: 4 AF: 1
        AF Parent
        because who’s talking about medium-size risks from AGI?
        Well, I have talked about them… :-)
        The capability claim is often formulated as the possibility of an AI achieving a decisive strategic advantage (DSA). While the notion of a DSA has been implicit in many previous works, the concept was first explicitly defined by Bostrom (2014, p. 78) as “a level of technological and other advantages sufficient to enable [an AI] to achieve complete world domination.”
        However, assuming that an AI will achieve a DSA seems like an unnecessarily strong form of the capability claim, as an AI could cause a catastrophe regardless. For instance, consider a scenario where an AI launches an attack calculated to destroy human civilization. If the AI was successful in destroying humanity or large parts of it, but the AI itself was also destroyed in the process, this would not count as a DSA as originally defined. Yet, it seems hard to deny that this outcome should nonetheless count as a catastrophe.
        Because of this, this chapter focuses on situations where an AI achieves (at least) a major strategic advantage (MSA), which we will define as “a level of technological and other advantages sufficient to pose a catastrophic risk to human society.” A catastrophic risk is one that might inflict serious damage to human well-being on a global scale and cause 10 million or more fatalities (Bostrom & Ćirković 2008).
        What links here?
        Investigating AI Takeover Scenarios by Sammy Martin (17 Sep 2021 18:47 UTC; 27 points)
        Sammy Martin's comment on Distinguishing AI takeover scenarios by Sam Clarke (8 Sep 2021 18:18 UTC; 17 points)
        Raemon 11 Jan 2022 2:49 UTC
        LW: 2 AF: 1
        AF Parent
        A year later, as we consider this for the 2020 Review, I think figuring out a better name is worth another look.
        Another option is “AI Catastrophe from First Principles”
  - ic 8 Oct 2020 23:53 UTC
    1 point
    Parent
    Very good point. Safety in Engineering is often summarized as “nothing bad happens”, without anthropomorphic nuance, and without “intent”: An engineered system can just go wrong. It seems “AI Safety” often glosses over or ignores such facets. Is it that “AI Safety” is cast as looking into the creation of “reasonable A(G)I” ?