One possible model of AI development is as follows: there exists some threshold beyond which capabilities are powerful enough to cause an x-risk, and such that we need alignment progress to be at the level needed to align that system before it comes into existence. I find it informative to think of this as a race where for capabilities the finish line is x-risk-capable AGI, and for alignment this is the ability to align x-risk-capable AGI. In this model, it is necessary but not sufficient for alignment for alignment to be ahead by the time it’s at the finish line for good outcomes: if alignment doesn’t make it there first, then we automatically lose, but even if it does, if alignment doesn’t continue to improve proportional to capabilities, we might also fail at some later point. However, I think it’s plausible we’re not even on track for the necessary condition, so I’ll focus on that within this post.
Given my distributions over how difficult AGI and alignment respectively are, and the amount of effort brought to bear on each of these problems, I think there’s a worryingly large chance that we just won’t have the alignment progress needed at the critical juncture.
I also think it’s plausible that at some point before when x-risks are possible, capabilities will advance to the point that the majority of AI research will be done by AI systems. The worry is that after this point, both capabilities and alignment will be similarly benefitted by automation, and if alignment is behind at the point when this happens, then this lag will be “locked in” because an asymmetric benefit to alignment research is needed to overtake capabilities if capabilities is already ahead.
There are a number of areas where this model could be violated:
Capabilities could turn out to be less accelerated than alignment by AI assistance. It seems like capabilities is mostly just throwing more hardware at the problem and scaling up, whereas alignment is much more conceptually oriented.
After research is mostly/fully automated, orgs could simply allocate more auto-research time to alignment than AGI.
Alignment(/coordination to slow down) could turn out to be easy. It could turn out that applying the same amount of effort to alignment and AGI results in alignment being solved first.
However, I don’t think these violations are likely for the following reasons respective:
It’s plausible that our current reliance on scaling is a product of our theory not being good enough and that it’s already possible to build AGI with current hardware if you have the textbook from the future. Even if the strong version of the claim isn’t true, one big reason that the bitter lesson is true is that bespoke engineering is currently expensive, and if it became suddenly a lot cheaper we would see a lot more of it and consequently squeezing more out of the same hardware. It also seems likely that before total automation, there will be a number of years where automation is best modelled as a multiplicative factor on human researcher effectiveness. In that case, because of the sheer number of capabilities researchers compared to alignment researchers, alignment researchers would have to benefit a lot more to just break even.
If it were the case that orgs would pivot, I would expect them to currently be allocating a lot more to alignment than they do currently. While it’s still plausible that orgs haven’t allocated more to alignment because they think AGI is far away, and that a world where automated research is a thing is a world where orgs would suddenly realize how close AGI is and pivot, that hypothesis hasn’t been very predictive so far. Further, because I expect the tech for research automation to be developed at roughly the same time by many different orgs, it seems like not only does one org have to prioritize alignment, but actually a majority weighted by auto research capacity have to prioritize alignment. To me, this seems difficult, although more tractable than the other alignment coordination problem, because there’s less of a unilateralist problem. The unilateralist problem still exists to some extent: orgs which prioritize alignment are inherently at a disadvantage compared to orgs that don’t, because capabilities progress feeds recursively into faster progress whereas alignment progress is less effective at making future alignment progress faster. However, on the relevant timescales this may become less important.
I think alignment is a very difficult problem, and that moreover by its nature it’s incredibly easy to underestimate. I should probably write a full post about my take on this at some point, and I don’t really have space here to really dive into it here, but a quick meta level argument for why we shouldn’t lean on alignment easiness even if there is a non negligible chance of easiness is that a) given the stakes, we should exercise extreme caution and b) there are very few problems we have that are in the same reference class as alignment, and of the few that are even close, like computer security, they don’t inspire a lot of confidence.
I think exploring the potential model violations further is a fruitful direction. I don’t think I’m very confident about this model.
One possible model of AI development is as follows: there exists some threshold beyond which capabilities are powerful enough to cause an x-risk, and such that we need alignment progress to be at the level needed to align that system before it comes into existence. I find it informative to think of this as a race where for capabilities the finish line is x-risk-capable AGI, and for alignment this is the ability to align x-risk-capable AGI. In this model, it is necessary but not sufficient for alignment for alignment to be ahead by the time it’s at the finish line for good outcomes: if alignment doesn’t make it there first, then we automatically lose, but even if it does, if alignment doesn’t continue to improve proportional to capabilities, we might also fail at some later point. However, I think it’s plausible we’re not even on track for the necessary condition, so I’ll focus on that within this post.
Given my distributions over how difficult AGI and alignment respectively are, and the amount of effort brought to bear on each of these problems, I think there’s a worryingly large chance that we just won’t have the alignment progress needed at the critical juncture.
I also think it’s plausible that at some point before when x-risks are possible, capabilities will advance to the point that the majority of AI research will be done by AI systems. The worry is that after this point, both capabilities and alignment will be similarly benefitted by automation, and if alignment is behind at the point when this happens, then this lag will be “locked in” because an asymmetric benefit to alignment research is needed to overtake capabilities if capabilities is already ahead.
There are a number of areas where this model could be violated:
Capabilities could turn out to be less accelerated than alignment by AI assistance. It seems like capabilities is mostly just throwing more hardware at the problem and scaling up, whereas alignment is much more conceptually oriented.
After research is mostly/fully automated, orgs could simply allocate more auto-research time to alignment than AGI.
Alignment(/coordination to slow down) could turn out to be easy. It could turn out that applying the same amount of effort to alignment and AGI results in alignment being solved first.
However, I don’t think these violations are likely for the following reasons respective:
It’s plausible that our current reliance on scaling is a product of our theory not being good enough and that it’s already possible to build AGI with current hardware if you have the textbook from the future. Even if the strong version of the claim isn’t true, one big reason that the bitter lesson is true is that bespoke engineering is currently expensive, and if it became suddenly a lot cheaper we would see a lot more of it and consequently squeezing more out of the same hardware. It also seems likely that before total automation, there will be a number of years where automation is best modelled as a multiplicative factor on human researcher effectiveness. In that case, because of the sheer number of capabilities researchers compared to alignment researchers, alignment researchers would have to benefit a lot more to just break even.
If it were the case that orgs would pivot, I would expect them to currently be allocating a lot more to alignment than they do currently. While it’s still plausible that orgs haven’t allocated more to alignment because they think AGI is far away, and that a world where automated research is a thing is a world where orgs would suddenly realize how close AGI is and pivot, that hypothesis hasn’t been very predictive so far. Further, because I expect the tech for research automation to be developed at roughly the same time by many different orgs, it seems like not only does one org have to prioritize alignment, but actually a majority weighted by auto research capacity have to prioritize alignment. To me, this seems difficult, although more tractable than the other alignment coordination problem, because there’s less of a unilateralist problem. The unilateralist problem still exists to some extent: orgs which prioritize alignment are inherently at a disadvantage compared to orgs that don’t, because capabilities progress feeds recursively into faster progress whereas alignment progress is less effective at making future alignment progress faster. However, on the relevant timescales this may become less important.
I think alignment is a very difficult problem, and that moreover by its nature it’s incredibly easy to underestimate. I should probably write a full post about my take on this at some point, and I don’t really have space here to really dive into it here, but a quick meta level argument for why we shouldn’t lean on alignment easiness even if there is a non negligible chance of easiness is that a) given the stakes, we should exercise extreme caution and b) there are very few problems we have that are in the same reference class as alignment, and of the few that are even close, like computer security, they don’t inspire a lot of confidence.
I think exploring the potential model violations further is a fruitful direction. I don’t think I’m very confident about this model.