evhub comments on How likely is deceptive alignment?

evhub 31 Aug 2022 1:18 UTC
LW: 30 AF: 18
11
AF
I think it really depends on the specific training setup. Some are much more likely than others to lead to deceptive alignment, in my opinion. Here are some numbers off the top of my head, though please don’t take these too seriously:
- ~90%: if you keep scaling up RL in complex environments ad infinitum, eventually you get deceptive alignment.
- ~80%: conditional on RL in complex environments being the first path to transformative AI, there will be deceptively aligned RL models.
- ~70%: if you keep scaling up GPT-style language modeling ad infinitum, eventually you get deceptive alignment.
- ~60%: there will be an existential catastrophe due to deceptive alignment specifically.
- ~30%: conditional on GPT-style language modeling being the first path to transformative AI, there will be deceptively aligned language models (not including deceptive simulacra, only deceptive simulators).
For the optimization pressure question, I really don’t know, but I think “2x, 4x” seems too low—that corresponds to only 1-2 bits. It would be pretty surprising to me if the absolute separation between the deceptive and non-deceptive models was that small in either direction for almost any training setup.
What links here?
- Ivan Vendrov 31 Aug 2022 16:49 UTC
  LW: 3 AF: 2
  0
  AF Parent
  Thank you for putting numbers on it!
  ~60%: there will be an existential catastrophe due to deceptive alignment specifically.
  Is this an unconditionally prediction of 60% chance of existential catastrophe due to deceptive alignment alone? In contrast to the commonly used 10% chance of existential catastrophe due to all AI sources this century. Or do you mean that, conditional on there being an existential catastrophe due to AI, 60% chance it will be caused by deceptive alignment, and 40% by other problems like misuse or outer alignment?
  - paulfchristiano 31 Aug 2022 19:33 UTC
    LW: 27 AF: 11
    4
    AF Parent
    In contrast to the commonly used 10% chance of existential catastrophe due to all AI sources this century
    Amongst the LW crowd I’m relatively optimistic, but I’m not that optimistic. I would give maybe 20% total risk of misalignment this century. (I’m generally expecting singularity this century with >75% chance such that most alignment risk ever will be this century.)
    The number is lower if you consider “how much alignment risk before AI systems are in the driver’s seat,” which I think is very often the more relevant question, but I’d still put it at 10-20%. At various points in the past my point estimates have ranged from 5% up to 25%.
    And then on top of that there are significant other risks from the transition to AI. Maybe a total of more like 40% total existential risk from AI this century? With extinction risk more like half of that, and more uncertain since I’ve thought less about it.
    I still find 60% risk from deceptive alignment quite implausible, but wanted to clarify that 10% total risk is not in line with my view and I suspect it is not a typical view on LW or the alignment forum.
    - Jeffrey Ladish 31 Aug 2022 20:14 UTC
      4 points
      0
      AF Parent
      And then on top of that there are significant other risks from the transition to AI. Maybe a total of more like 40% total existential risk from AI this century? With extinction risk more like half of that, and more uncertain since I’ve thought less about it.
      40% total existential risk, and extinction risk half of that? Does that mean the other half is some kind of existential catastrophe / bad values lock-in but where humans do survive?
      - evhub 31 Aug 2022 20:55 UTC
        LW: 6 AF: 4
        4
        AF Parent
        Fwiw, I would put non-extinction existential risk at ~80% of all existential risk from AI. So maybe my extinction numbers are actually not too different than Paul’s (seems like we’re both ~20% on extinction specifically).
        iamthouthouarti 1 Sep 2022 7:29 UTC
        8 points
        1
        Parent
        And then there’s me who was so certain until now that any time people talk about x-risk they mean it to be synonymous with extinction. It does make me curious though, what kind of scenarios are you imagining in which misalignment doesn’t kill everyone? Do more people place a higher credence on s-risk than I originally suspected?
  - evhub 31 Aug 2022 19:14 UTC
    LW: 26 AF: 12
    4
    AF Parent
    Unconditional. I’m rather more pessimistic than an overall 10% chance. I usually give ~80% chance of existential risk from AI.
    What links here?
    Why I’m joining Anthropic by evhub (5 Jan 2023 1:12 UTC; 117 points)
    evhub's comment on Integrity in AI Governance and Advocacy by habryka (4 Nov 2023 1:31 UTC; 89 points)
    I read every major AI lab’s safety plan so you don’t have to by sarahhw (EA Forum; 16 Dec 2024 14:12 UTC; 65 points)
    evhub's comment on Timelines are short, p(doom) is high: a global stop to frontier AI development until x-safety consensus is our only reasonable hope by Greg_Colbourn (EA Forum; 13 Oct 2023 5:17 UTC; 25 points)
    I read every major AI lab’s safety plan so you don’t have to by sarahhw (16 Dec 2024 18:51 UTC; 20 points)
    Thomas Kwa's comment on Preventing an AI-related catastrophe—Problem profile by Benjamin Hilton (EA Forum; 29 Aug 2022 21:57 UTC; 11 points)
    RobertM's comment on AI #27: Portents of Gemini by Zvi (1 Sep 2023 23:52 UTC; 8 points)