The alignment difficulty scale is based on this post.
I really like this post and think it’s a useful addendum to my own alignment difficulty scale (making it 2D, essentially). But I think I was conceptualizing my scale as running along the diagonal line you provide from GPT-4 to sovereign AI. But I think your way of doing it is better on reflection.
In my original post when I suggested that the ‘target’ level of capability we care about is the capability level needed to build positively transformative AI (pTAI), which is essentially the ‘minimal aligned AGI that can do a pivotal act’ notion but is more agnostic about whether it will be a unitary agentic system or many systems deployed over a period.
I think that what most people talk about when they talk about alignment difficulty isn’t how hard the problem ‘ultimately’ is but rather how hard the problem is that we need to solve, with disagreements also being about e.g. how capable an AI you need for various pivotal/positively transformative acts.
I didn’t split these up because I think that in a lot of people’s minds the two run together in a fairly unprincipled way, but if we want a scale that corresponds to real things in the world having a 2D chart like this is better.
I agree that the difficulty of the alignment problem can be thought of as a diagonal line on the 2D chart above as you described.
This model may make having two axes instead of one unnecessary. If capabilities and alignment scale together predictably, then high alignment difficulty is associated with high capabilities, and therefore the capabilities axis could be unnecessary.
But I think there’s value in having two axes. Another way to think about your AI alignment difficulty scale is like a vertical line in the 2D chart: for a given level of AI capability (e.g. pivotal AGI), there is uncertainty about how hard it would be to align such an AGI because the gradient of the diagonal line intersecting the vertical line is uncertain.
Instead of a single diagonal line, I now think the 2D model describes alignment difficulty in terms of the gradient of the line. An optimistic scenario is one where AI capabilities are scaled and few additional alignment problems arise or existing alignment problems do not become more severe because more capable AIs naturally follow human instructions and learn complex values. A highly optimistic possibility is that increased capabilities and alignment are almost perfectly correlated and arbitrarily capable AIs are no more difficult to align than current systems. Easy worlds correspond to lines in the 2D chart with low gradients and low-gradient lines intersect the vertical line corresponding to the 1D scale at a low point.
A pessimistic scenario can be represented in the chart as a steep line where alignment problems rapidly crop up as capabilities are increased. For example, in such hard worlds, increased capabilities could make deception and self-preservation much more likely to arise in AIs. Problems like goal misgeneralization might persist or worsen even in highly capable systems. Therefore, in hard worlds, AI alignment difficulty increases rapidly with capabilities and increased capabilities do not have helpful side effects such as the formation of natural abstrations that could curtail the increasing difficulty of the AI alignment problem. In hard worlds, since AI capabilities gains cause a rapid increase in alignment difficulty, the only way to ensure that alignment research keeps up with the rapidly increasing difficulty of the alignment problem is to limit progress in AI capabilities.
I really like this post and think it’s a useful addendum to my own alignment difficulty scale (making it 2D, essentially). But I think I was conceptualizing my scale as running along the diagonal line you provide from GPT-4 to sovereign AI. But I think your way of doing it is better on reflection.
In my original post when I suggested that the ‘target’ level of capability we care about is the capability level needed to build positively transformative AI (pTAI), which is essentially the ‘minimal aligned AGI that can do a pivotal act’ notion but is more agnostic about whether it will be a unitary agentic system or many systems deployed over a period.
I think that what most people talk about when they talk about alignment difficulty isn’t how hard the problem ‘ultimately’ is but rather how hard the problem is that we need to solve, with disagreements also being about e.g. how capable an AI you need for various pivotal/positively transformative acts.
I didn’t split these up because I think that in a lot of people’s minds the two run together in a fairly unprincipled way, but if we want a scale that corresponds to real things in the world having a 2D chart like this is better.
I agree that the difficulty of the alignment problem can be thought of as a diagonal line on the 2D chart above as you described.
This model may make having two axes instead of one unnecessary. If capabilities and alignment scale together predictably, then high alignment difficulty is associated with high capabilities, and therefore the capabilities axis could be unnecessary.
But I think there’s value in having two axes. Another way to think about your AI alignment difficulty scale is like a vertical line in the 2D chart: for a given level of AI capability (e.g. pivotal AGI), there is uncertainty about how hard it would be to align such an AGI because the gradient of the diagonal line intersecting the vertical line is uncertain.
Instead of a single diagonal line, I now think the 2D model describes alignment difficulty in terms of the gradient of the line. An optimistic scenario is one where AI capabilities are scaled and few additional alignment problems arise or existing alignment problems do not become more severe because more capable AIs naturally follow human instructions and learn complex values. A highly optimistic possibility is that increased capabilities and alignment are almost perfectly correlated and arbitrarily capable AIs are no more difficult to align than current systems. Easy worlds correspond to lines in the 2D chart with low gradients and low-gradient lines intersect the vertical line corresponding to the 1D scale at a low point.
A pessimistic scenario can be represented in the chart as a steep line where alignment problems rapidly crop up as capabilities are increased. For example, in such hard worlds, increased capabilities could make deception and self-preservation much more likely to arise in AIs. Problems like goal misgeneralization might persist or worsen even in highly capable systems. Therefore, in hard worlds, AI alignment difficulty increases rapidly with capabilities and increased capabilities do not have helpful side effects such as the formation of natural abstrations that could curtail the increasing difficulty of the AI alignment problem. In hard worlds, since AI capabilities gains cause a rapid increase in alignment difficulty, the only way to ensure that alignment research keeps up with the rapidly increasing difficulty of the alignment problem is to limit progress in AI capabilities.