[deactivated] comments on AGI Ruin: A List of Lethalities

[deactivated] 15 Nov 2023 18:12 UTC
1 point
0
AF
I don’t know if anyone still reads comments on this post from over a year ago. Here goes nothing.
I am trying to understand the argument(s) as deeply and faithfully as I can. These two sentences from Section B.2 stuck out to me as the most important in the post (from the point of view of my understanding):
...outer optimization even on a very exact, very simple loss function doesn’t produce inner optimization in that direction.
...on the current optimization paradigm there is no general idea of how to get particular inner properties into a system, or verify that they’re there, rather than just observable outer ones you can run a loss function over.
My first question is: supposing this is all true, what is the probability of failure of inner alignment? Is it 0.01%, 99.99%, 50%...? And how do we know how likely failure is?

It seems like there is a gulf between “it’s not guaranteed to work” and “it’s almost certain to fail”.
What links here?
- [deactivated]'s comment on [deactivated]’s Shortform by [deactivated] (15 Nov 2023 18:58 UTC; 7 points)
- Carl Feynman 15 Nov 2023 19:02 UTC
  LW: 2 AF: 1
  −5
  AF Parent
  Inner alignment failure is a phenomenon that has happened in existing AI systems, weak as they are. So we know it can happen. We are on track to build many superhuman AI systems. Unless something unexpectedly good happens, eventually we will build one that has a failure of inner alignment. And then it will kill us all. Does the probability of any given system failing inner alignment really matter?
  - [deactivated] 15 Nov 2023 20:35 UTC
    1 point
    −2
    AF Parent
    
    We are on track to build many superhuman AI systems. Unless something unexpectedly good happens, eventually we will build one that has a failure of inner alignment. And then it will kill us all. Does the probability of any given system failing inner alignment really matter?
    
    Yes, because if the first superhuman AGI is aligned, and if it performs a pivotal act to prevent misaligned AGI from being created, then we will avert existential catastrophe.
    
    If there is a 99.99% chance of that happening, then we should be quite sanguine about AI x-risk. On the other hand, if there is only a 0.01% chance, then we should be very worried.
    - Tapatakt 15 Nov 2023 20:45 UTC
      LW: 1 AF: -1
      −3
      AF Parent
      It’s hard to guess, but it happened when the only one known to us general intelligence was created by a hill-climbing process.
      - TurnTrout 15 Nov 2023 21:23 UTC
        LW: 6 AF: 3
        −6
        AF Parent
        I think it’s inappropriate to call evolution a “hill-climbing process” in this context, since those words seem optimized to sneak in parallels to SGD. Separately, I think that evolution is a bad analogy for AGI training.
      - [deactivated] 15 Nov 2023 20:59 UTC
        1 point
        0
        Parent
        This seems super important to the argument! Do you know if it’s been discussed in detail anywhere else?