I’m not sure we’re worrying about the same regimes.
The regime I’m most worried about is:
AI systems which are much smarter than the smartest humans
...
It’s unclear to me whether the authors are discussing alignment in a regime like the one above, or a regime like “LLMs which are not much smarter than the smartest humans.” (I too am very optimistic about remaining safe in this latter regime.)
...
The AI Optimists don’t make this argument AFAICT, but I think optimism about effectively utilizing “human level” models should transfer to a considerable amount of optimism about smarter than human models due to the potential for using these “human level” systems to develop considerably better safety technology (e.g. alignment research). AIs might have structural advantages (speed, cost, and standardization) which make it possible heavily accelerate R&D[1] even at around qualitatively “human level” capabilities. (That said, my overall view is that even if we had the exact human capability profile while also having ML structural advantages these systems would themselves pose substantial (e.g. 15%) catastrophic misalignment x-risk on the “default” trajectory because we’ll want to run extremely large numbers of these systems at high speeds.)
The idea of using human level models like this has a bunch of important caveats which mean you shouldn’t end up being extremely optimistic overall IMO[2]:
It’s not clear that “human level” will be a good description at any point. AIs might be way smarter than humans in some domains while way dumber in other domains. This can cause the oversight issues mentioned in the parent comment to manifest prior to massive acceleration of alignment research. (In practice, I’m moderately optimistic here.)
Is massive effective acceleration enough? We need safety technology to keep up with capabilites and capabilities might also be accelerated. There is the potential for arbitrarily scalable approaches to safety which should make us somewhat more optimistic. But, it might end up being the case that to avoid catastrophe from AIs which are one step smarter than humans we need the equivalent of having the 300 best safety researchers work for 500 years and we won’t have enough acceleration and delay to manage this. (In practice I’m somewhat optimistic here so long as we can get a 1-3 year delay at a critical point.)
Will “human level” systems be sufficiently controlled to get enough useful work? Even if systems could hypothetically be very useful, it might be hard to quickly get them actually doing useful work (particularly in fuzzy domains like alignment etc.). This objection holds even if we aren’t worried about catastrophic misalignment risk.
Plans that rely on aligned AGIs working on alignment faster than humans would need to ensure that no AGIs work on anything else in the meantime. The reason humans have no time to develop alignment of superintelligence is that other humans develop misaligned superintelligence faster. Similarly by default very fast AGIs working on alignment end up having to compete with very fast AGIs working on other things that lead to misaligned superintelligence. Preventing aligned AGIs from building misaligned superintelligence is not clearly more manageable than preventing humans from building AGIs.
Plans that rely on aligned AGIs working on alignment faster than humans would need to ensure that no AGIs work on anything else in the meantime.
This isn’t true. It could be that making an arbitrarily scalable solution to alignment takes X cognitive resources and in practice building an uncontrollably powerful AI takes Y cognitive resources with X < Y.
(Also, this plan doesn’t require necessarily aligning “human level” AIs, just being able to get work out of them with sufficiently high productivity and low danger.)
I’m being a bit simplistic. The point is that it needs to stop being a losing or a close race, and all runners getting faster doesn’t obviously help with that problem. I guess there is some refactor vs. rewrite feel to the distinction between the project of stopping humans from building AGIs right now, and the project of getting first AGIs to work on alignment and global security in a post-AGI world faster than other AGIs overshadow such work. The former has near/concrete difficulties, the latter has nebulous difficulties that don’t as readily jump to attention. The whole problem is messiness and lack of coordination, so starting from scratch with AGIs seems more promising than reforming human society. But without strong coordination on development and deployment of first AGIs, the situation with activities of AGIs is going to be just as messy and uncoordinated, only unfolding much faster, and that’s not even counting the risk of getting a superintelligence right away.
I’m on the optimists discord and I do make the above argument explicitly in this presentation (e.g. slide 4): Reasons for optimism about superalignment (though, fwiw, Idk if I’d go all the way down to 1% p(doom), but I have probably updated something like 10% to <5%, and most of my uncertainty now comes more from the governance / misuse side).
On your points ‘Is massive effective acceleration enough?’ and ‘Will “human level” systems be sufficiently controlled to get enough useful work?’, I think conditioned on aligned-enough ~human-level automated alignment RAs, the answers to the above are very likely yes, because it should be possible to get a very large amount of work out of those systems even in a very brief amount of time—e.g. a couple of months (feasible with e.g. a coordinated pause, or even with a sufficient lead). See e.g. slides 9, 10 of the above presentation (and I’ll note that this argument isn’t new, it’s been made in variously similar forms by e.g. Ajeya Cotra, Lukas Finnveden, Jacob Steinhardt).
I’m generally reasonably optimistic about using human level-ish systems to do a ton of useful work while simultaneously avoiding most risk from these systems. But, I think this requires substantial effort and won’t clearly go well by default.
The AI Optimists don’t make this argument AFAICT, but I think optimism about effectively utilizing “human level” models should transfer to a considerable amount of optimism about smarter than human models due to the potential for using these “human level” systems to develop considerably better safety technology (e.g. alignment research). AIs might have structural advantages (speed, cost, and standardization) which make it possible heavily accelerate R&D[1] even at around qualitatively “human level” capabilities. (That said, my overall view is that even if we had the exact human capability profile while also having ML structural advantages these systems would themselves pose substantial (e.g. 15%) catastrophic misalignment x-risk on the “default” trajectory because we’ll want to run extremely large numbers of these systems at high speeds.)
The idea of using human level models like this has a bunch of important caveats which mean you shouldn’t end up being extremely optimistic overall IMO[2]:
Is massive effective acceleration enough? We need safety technology to keep up with capabilites and capabilities might also be accelerated. There is the potential for arbitrarily scalable approaches to safety which should make us somewhat more optimistic. But, it might end up being the case that to avoid catastrophe from AIs which are one step smarter than humans we need the equivalent of having the 300 best safety researchers work for 500 years and we won’t have enough acceleration and delay to manage this. (In practice I’m somewhat optimistic here so long as we can get a 1-3 year delay at a critical point.)
Will “human level” systems be sufficiently controlled to get enough useful work? Even if systems could hypothetically be very useful, it might be hard to quickly get them actually doing useful work (particularly in fuzzy domains like alignment etc.). This objection holds even if we aren’t worried about catastrophic misalignment risk.
At least R&D which isn’t very limited by physical processes.
I think <1% doom seems too optimistic without more of a story for how we’re going to handle super human models.
Plans that rely on aligned AGIs working on alignment faster than humans would need to ensure that no AGIs work on anything else in the meantime. The reason humans have no time to develop alignment of superintelligence is that other humans develop misaligned superintelligence faster. Similarly by default very fast AGIs working on alignment end up having to compete with very fast AGIs working on other things that lead to misaligned superintelligence. Preventing aligned AGIs from building misaligned superintelligence is not clearly more manageable than preventing humans from building AGIs.
This isn’t true. It could be that making an arbitrarily scalable solution to alignment takes X cognitive resources and in practice building an uncontrollably powerful AI takes Y cognitive resources with X < Y.
(Also, this plan doesn’t require necessarily aligning “human level” AIs, just being able to get work out of them with sufficiently high productivity and low danger.)
I’m being a bit simplistic. The point is that it needs to stop being a losing or a close race, and all runners getting faster doesn’t obviously help with that problem. I guess there is some refactor vs. rewrite feel to the distinction between the project of stopping humans from building AGIs right now, and the project of getting first AGIs to work on alignment and global security in a post-AGI world faster than other AGIs overshadow such work. The former has near/concrete difficulties, the latter has nebulous difficulties that don’t as readily jump to attention. The whole problem is messiness and lack of coordination, so starting from scratch with AGIs seems more promising than reforming human society. But without strong coordination on development and deployment of first AGIs, the situation with activities of AGIs is going to be just as messy and uncoordinated, only unfolding much faster, and that’s not even counting the risk of getting a superintelligence right away.
I’m on the optimists discord and I do make the above argument explicitly in this presentation (e.g. slide 4): Reasons for optimism about superalignment (though, fwiw, Idk if I’d go all the way down to 1% p(doom), but I have probably updated something like 10% to <5%, and most of my uncertainty now comes more from the governance / misuse side).
On your points ‘Is massive effective acceleration enough?’ and ‘Will “human level” systems be sufficiently controlled to get enough useful work?’, I think conditioned on aligned-enough ~human-level automated alignment RAs, the answers to the above are very likely yes, because it should be possible to get a very large amount of work out of those systems even in a very brief amount of time—e.g. a couple of months (feasible with e.g. a coordinated pause, or even with a sufficient lead). See e.g. slides 9, 10 of the above presentation (and I’ll note that this argument isn’t new, it’s been made in variously similar forms by e.g. Ajeya Cotra, Lukas Finnveden, Jacob Steinhardt).
I’m generally reasonably optimistic about using human level-ish systems to do a ton of useful work while simultaneously avoiding most risk from these systems. But, I think this requires substantial effort and won’t clearly go well by default.