Is this “sharp left turn” a crux for your overall view, or your high probability of failure?
Naively, it seems to me that if capability gains are systematically gradual, that improvements are iterative, and occur at little at a time, we’re in a much better situation with regard to alignment.
If capabilities gains are gradual, we can continuously feed training data to our system and keep its alignment in step with its capabilities. As soon as it starts to enter a distributional shift, and some of its outputs are (or would be) unaligned, those alignment failures are immediately corrected. You can keep reinforcing corrigibility as capabilities generalize, so that it correctly generalizes the corrigibility concept. Similarly, the more gradually capabilities grow, the more reliable oversight schemes will be.
(On the other hand, this doesn’t solve the problem that there’s some capability threshold beyond which the outputs of an AI system are illegible to humans, and we can’t tell whether or not the outputs are aligned or not, in order to give it corrective training data.
Also, if one could, in principle, increase capabilities gradually, but someone else can throw caution to the wind and turn up the capability dial to 11, the unilateralist’s curse kills us.)
How much would finding out that there’s not going to be a sharp left turn impact the rest of your model?
Or, suppose we could magically scale up our systems as gradually as you, Nate, would like, slowing down as we start to see a super-linear improvement, how much safer does is humanity?
Is this “sharp left turn” a crux for your overall view, or your high probability of failure?
Naively, it seems to me that if capability gains are systematically gradual, that improvements are iterative, and occur at little at a time, we’re in a much better situation with regard to alignment.
If capabilities gains are gradual, we can continuously feed training data to our system and keep its alignment in step with its capabilities. As soon as it starts to enter a distributional shift, and some of its outputs are (or would be) unaligned, those alignment failures are immediately corrected. You can keep reinforcing corrigibility as capabilities generalize, so that it correctly generalizes the corrigibility concept. Similarly, the more gradually capabilities grow, the more reliable oversight schemes will be.
(On the other hand, this doesn’t solve the problem that there’s some capability threshold beyond which the outputs of an AI system are illegible to humans, and we can’t tell whether or not the outputs are aligned or not, in order to give it corrective training data.
Also, if one could, in principle, increase capabilities gradually, but someone else can throw caution to the wind and turn up the capability dial to 11, the unilateralist’s curse kills us.)
How much would finding out that there’s not going to be a sharp left turn impact the rest of your model?
Or, suppose we could magically scale up our systems as gradually as you, Nate, would like, slowing down as we start to see a super-linear improvement, how much safer does is humanity?