I don’t know if anyone still reads comments on this post from over a year ago. Here goes nothing.
I am trying to understand the argument(s) as deeply and faithfully as I can. These two sentences from Section B.2 stuck out to me as the most important in the post (from the point of view of my understanding):
...outer optimization even on a very exact, very simple loss function doesn’t produce inner optimization in that direction.
...on the current optimization paradigm there is no general idea of how to get particular inner properties into a system, or verify that they’re there, rather than just observable outer ones you can run a loss function over.
My first question is: supposing this is all true, what is the probability of failure of inner alignment? Is it 0.01%, 99.99%, 50%...? And how do we know how likely failure is?
It seems like there is a gulf between “it’s not guaranteed to work” and “it’s almost certain to fail”.
Inner alignment failure is a phenomenon that has happened in existing AI systems, weak as they are. So we know it can happen. We are on track to build many superhuman AI systems. Unless something unexpectedly good happens, eventually we will build one that has a failure of inner alignment. And then it will kill us all. Does the probability of any given system failing inner alignment really matter?
We are on track to build many superhuman AI systems. Unless something unexpectedly good happens, eventually we will build one that has a failure of inner alignment. And then it will kill us all. Does the probability of any given system failing inner alignment really matter?
Yes, because if the first superhuman AGI is aligned, and if it performs a pivotal act to prevent misaligned AGI from being created, then we will avert existential catastrophe.
If there is a 99.99% chance of that happening, then we should be quite sanguine about AI x-risk. On the other hand, if there is only a 0.01% chance, then we should be very worried.
I think it’s inappropriate to call evolution a “hill-climbing process” in this context, since those words seem optimized to sneak in parallels to SGD. Separately, I think that evolution is a bad analogy for AGI training.
I don’t know if anyone still reads comments on this post from over a year ago. Here goes nothing.
I am trying to understand the argument(s) as deeply and faithfully as I can. These two sentences from Section B.2 stuck out to me as the most important in the post (from the point of view of my understanding):
My first question is: supposing this is all true, what is the probability of failure of inner alignment? Is it 0.01%, 99.99%, 50%...? And how do we know how likely failure is?
It seems like there is a gulf between “it’s not guaranteed to work” and “it’s almost certain to fail”.
Inner alignment failure is a phenomenon that has happened in existing AI systems, weak as they are. So we know it can happen. We are on track to build many superhuman AI systems. Unless something unexpectedly good happens, eventually we will build one that has a failure of inner alignment. And then it will kill us all. Does the probability of any given system failing inner alignment really matter?
Yes, because if the first superhuman AGI is aligned, and if it performs a pivotal act to prevent misaligned AGI from being created, then we will avert existential catastrophe.
If there is a 99.99% chance of that happening, then we should be quite sanguine about AI x-risk. On the other hand, if there is only a 0.01% chance, then we should be very worried.
It’s hard to guess, but it happened when the only one known to us general intelligence was created by a hill-climbing process.
I think it’s inappropriate to call evolution a “hill-climbing process” in this context, since those words seem optimized to sneak in parallels to SGD. Separately, I think that evolution is a bad analogy for AGI training.
This seems super important to the argument! Do you know if it’s been discussed in detail anywhere else?