(Posting this initial comment without having read the whole thing because I won’t have a chance to come back to it today; apologies if you address this later or if it’s clearly addressed in a comment)
If we solve alignment and create personal intent aligned AGI but nobody manages a pivotal act, I see a likely future world with an increasing number of AGIs capable of recursively self-improving.
It seems worth spelling out your view here on how RSI-capable early AGI is likely to be. I would expect that early AGI will be capable of RSI in the weak sense of being able to do capabilities research and help plan training runs, but not capable of RSI in the strong sense of being able to eg directly edit their own weights in ways that significantly improve their intelligence or other capabilities.
I think this matters for your scenario, because the weaker form of RSI still requires either a large cluster of commercial GPUs (which seems hard to do secretly / privately), or ultra-high-precision manufacturing capabilities, which we know are extremely difficult to achieve at human-level intelligence.
Great point. I definitely mean fully capable of recursive self-improvement—that is, needing no humans in the loop. This lengthens the timelines to at least when we have roughly human-level robotics that are commercially available- but I expect that to be ten years or less.
The hardware requirements for early AGI are another factor in the timeline before this RSI-catastrophe is possible. Let’s remember that algorithmic progress is roughly as fast as hardware progress to date, so that will also cease to be a large limitation all too soon.
The problem is that not having that scenario be immediately a risk may make people complacent about allowing lots of parahuman AGI before it becomes superhuman and fully RSI capable.
Got it. I think I personally expect a period of at least 2-3 years when we have human-level AI (~‘as good as or better than most humans at most tasks’) but it’s not capable of full RSI.
It also seems plausible to me that strong RSI in the sense I use it above (‘able to eg directly edit their own weights in ways that significantly improve their intelligence or other capabilities’) may take a long time to develop or even require already-superhuman levels of intelligence. As a loose demonstration of that possibility, the best team of neurosurgeons etc in the world couldn’t currently operate on someone’s brain to give them greater intelligence, even if they had tools that let them precisely edit individual neurons and connections. I’m certainly not confident that’s much too hard for human-level AI, but it seems plausible.
The problem is that not having that scenario be immediately a risk may make people complacent about allowing lots of parahuman AGI before it becomes superhuman and fully RSI capable.
That seems highly plausible to me too; my mainline guess is that by default, given human-level AI, it rapidly proliferates as replacement employees and for other purposes until either there’s a sufficiently large catastrophe, or it improves to superhuman.
capable of RSI in the weak sense of being able to do capabilities research and help plan training runs
The speed at which this kind of thing is possible is crucial, even if capabilities are not above human level. This speed can make planning of training runs less central to the bulk of worthwhile activities. With very high speed, much more theoretical research that doesn’t require waiting for currently plannable training runs becomes useful, as well as things like rewriting all the software, even if models themselves can’t be “manually” retrained as part of this process. Plausibly at some point in the theoretical research you unlock online learning, even the kind that involves gradually shifting to a different architecture, and the inconvenience of distinct training runs disappears.
So this weak RSI would either need to involve AIs that can’t autonomously research, but can help the researchers or engineers, or the AIs need to be sufficiently slow and non-superintelligent that they can’t run through decades of research in months.
This speed can make planning of training runs less central to the bulk of worthwhile activities. With very high speed, much more theoretical research that doesn’t require waiting for currently plannable training runs becomes useful
It doesn’t seem clear to me that this is the case; there isn’t necessarily a faster way to precisely predict the behavior and capabilities of a new model than training it (other than crude measures like ‘loss on next-token prediction continues to decrease as the following function of parameter count’).
It does seem possible and even plausible, but I think our theoretical understanding would have to improve enormously in order to make large advances without empirical testing.
I mean theoretical research on more general topics, not necessarily directly concerned with any given training run or even with AI. I’m considering the consequences of there being an AI that can do human level research in math and theoretical CS at much greater speed than humanity. It’s not useful when it’s slow, so that the next training run will make what little progress is feasible irrelevant, in the same way they don’t currently train frontier models for 2 years, since a bigger training cluster will get online in 1 and then outrun the older run. But with sufficient speed, catching up on theory from distant future can become worthwhile.
(Posting this initial comment without having read the whole thing because I won’t have a chance to come back to it today; apologies if you address this later or if it’s clearly addressed in a comment)
It seems worth spelling out your view here on how RSI-capable early AGI is likely to be. I would expect that early AGI will be capable of RSI in the weak sense of being able to do capabilities research and help plan training runs, but not capable of RSI in the strong sense of being able to eg directly edit their own weights in ways that significantly improve their intelligence or other capabilities.
I think this matters for your scenario, because the weaker form of RSI still requires either a large cluster of commercial GPUs (which seems hard to do secretly / privately), or ultra-high-precision manufacturing capabilities, which we know are extremely difficult to achieve at human-level intelligence.
Great point. I definitely mean fully capable of recursive self-improvement—that is, needing no humans in the loop. This lengthens the timelines to at least when we have roughly human-level robotics that are commercially available- but I expect that to be ten years or less.
The hardware requirements for early AGI are another factor in the timeline before this RSI-catastrophe is possible. Let’s remember that algorithmic progress is roughly as fast as hardware progress to date, so that will also cease to be a large limitation all too soon.
The problem is that not having that scenario be immediately a risk may make people complacent about allowing lots of parahuman AGI before it becomes superhuman and fully RSI capable.
Got it. I think I personally expect a period of at least 2-3 years when we have human-level AI (~‘as good as or better than most humans at most tasks’) but it’s not capable of full RSI.
It also seems plausible to me that strong RSI in the sense I use it above (‘able to eg directly edit their own weights in ways that significantly improve their intelligence or other capabilities’) may take a long time to develop or even require already-superhuman levels of intelligence. As a loose demonstration of that possibility, the best team of neurosurgeons etc in the world couldn’t currently operate on someone’s brain to give them greater intelligence, even if they had tools that let them precisely edit individual neurons and connections. I’m certainly not confident that’s much too hard for human-level AI, but it seems plausible.
That seems highly plausible to me too; my mainline guess is that by default, given human-level AI, it rapidly proliferates as replacement employees and for other purposes until either there’s a sufficiently large catastrophe, or it improves to superhuman.
The speed at which this kind of thing is possible is crucial, even if capabilities are not above human level. This speed can make planning of training runs less central to the bulk of worthwhile activities. With very high speed, much more theoretical research that doesn’t require waiting for currently plannable training runs becomes useful, as well as things like rewriting all the software, even if models themselves can’t be “manually” retrained as part of this process. Plausibly at some point in the theoretical research you unlock online learning, even the kind that involves gradually shifting to a different architecture, and the inconvenience of distinct training runs disappears.
So this weak RSI would either need to involve AIs that can’t autonomously research, but can help the researchers or engineers, or the AIs need to be sufficiently slow and non-superintelligent that they can’t run through decades of research in months.
It doesn’t seem clear to me that this is the case; there isn’t necessarily a faster way to precisely predict the behavior and capabilities of a new model than training it (other than crude measures like ‘loss on next-token prediction continues to decrease as the following function of parameter count’).
It does seem possible and even plausible, but I think our theoretical understanding would have to improve enormously in order to make large advances without empirical testing.
I mean theoretical research on more general topics, not necessarily directly concerned with any given training run or even with AI. I’m considering the consequences of there being an AI that can do human level research in math and theoretical CS at much greater speed than humanity. It’s not useful when it’s slow, so that the next training run will make what little progress is feasible irrelevant, in the same way they don’t currently train frontier models for 2 years, since a bigger training cluster will get online in 1 and then outrun the older run. But with sufficient speed, catching up on theory from distant future can become worthwhile.
Oh, I see, I was definitely misreading you; thanks for the clarification!