The question is if research capable TAI can lag behind government-alarming long-horizon task capable AI (that does many jobs and so even Robin Hanson starts paying attention). These are two different thresholds that might both be called “AGI”, so it’s worth making a careful distinction. Even if it turns out that in practice they coincide and the same system becomes the first to qualify for both, for now we don’t know if that’s the case, and conceptually they are different.
If this lag is sufficient, governments might be able to succeed in locking down enough compute to prevent independent development of research capable TAI for many more years. This includes stopping or even reversing improvements in AI accelerators. If govenments only become alarmed once there is a research capable TAI, that gives the other possibility where TAI is developed by everyone very quickly and the opportunity to do it more carefully is lost.
Increasing investment is the crucial consideration in the sense that if research capable TAI is possible with modest investment, then there is no preventing its independent development. But if the necessary investment turns out to be sufficiently outrageous, controlling development of TAI by controlling hardware becomes feasible. Advancements in hardware are easy to control if most governments are alarmed, the supply chains are large, the datacenters are large. And algorithmic improvements have a sufficiently low ceiling to keep what would otherwise be $10 trillion training runs infeasible for independent actors even if done with better methods. The hypothetical I was describing has research capable TAI 2-3 OOMs above the $100 billion necessary for long-horizon task capable AI, which as a barrier for feasibility can survive some algorithmic improvements.
I also think the improvements themselves are probably running out. There’s only about 5x improvement in all these years for the dense transformer, a significant improvement from MoE, possibly some improvement from Mixture of Depths. All attention alternatives remain in the ballpark despite having very different architectures. Something significantly non-transformer-like is probably necessary to get more OOMs of algorithmic progress, which is also the case if LLMs can’t be scaled to research capable TAI at all.
(Recent unusually fast improvement in hardware was mostly driven by moving to lower precision, first BF16, then FP8 with H100s, and now Microscaling (FP4, FP6) with Blackwell. This process is also at an end, lower-level hardware improvement will be slower. But unlike algorithmic improvements, this point is irrelevant to the argument, since improvement in hardware available to independent actors can be stopped or reversed by governments, unlike algorithmic improvements.)
I also think the improvements themselves are probably running out.
I disagree, though this is based on some guesswork (and Leopold’s analysis, as a recently-ex-insider). I don’t know exactly how they’re doing it (improvements in training data filtering is probably part of it), but the foundation model companies have all been putting out models with lower inference costs and latencies for the same capability level (OpenAI; GPT-4 Turbo, GPT-4o vs GPT-4; Anthropic Claude 3.5 Sonnet vs. the Claude 3 generation; Google: Gemini 1.5 vs 1). I am assuming that the reason for this performance improvement is that the newer models actually had lower parameter counts (which is supported by some rumored parameter count numbers), and I’m then also assuming that means these had lower total compute to train. (The latter assumption would be false for smaller models trained via distillation from a larger model, as some of the smaller Google models almost certainly are, or heavily overtrained by Chinchilla standards, as has recently become popular for models that are not the largest member of a model family.)
Things like the effectiveness of model pruning methods suggest that there are a lot of wasted parameters inside current models, which would suggest there’s still a lot of room for performance improvements. The huge context lengths that foundation model companies are now advertising without huge cost differentials also rather suggest something architectural has happened there, which isn’t just full attention quadratic-cost classical transformers. What combination of the techniques from the academic literature, or ones not in the academic literature, that’s based on is unclear, but clearly something improved there.
Algorithmic improvements relevant to my argument are those that happen after long-horizon task capable AIs are demonstrated, in particular it doesn’t matter how much progress is happening now, other than as evidence about what happens later.
heavily overtrained by Chinchilla standards
This is necessarily part of it. It involves using more compute, not less, which is natural given that new training environments are getting online, and doesn’t need any algorithmic improvements at all to produce models that are both cheaper for inference and smarter. You can take a Chinchilla optimal model, make it 3x smaller and train it on 9x data, expending 3x more compute, and get approximately the same result. If you up the compute and data a bit more, the model will become more capable. Some current improvements are probably due to better use of pre-training data, but these things won’t survive significant further scaling intact. There are also improvements in post-training, but they are even less relevant to my argument, assuming they are not lagging behind too badly in unlocking the key thresholds of capability.
Algorithmic improvements relevant to my argument are those that happen after long-horizon task capable AIs are demonstrated, in particular it doesn’t matter how much progress is happening now, other than as evidence about what happens later
My apologies, you’re right, I had misunderstood you, and thus we’ve been talking at cross-purposes. You were discussing
…if research capable TAI can lag behind government-alarming long-horizon task capable AI (that does many jobs and so even Robin Hanson starts paying attention)
while I was instead talking about how likely it was that running out of additional money to invest slowed reaching either of these forms of AGI (which I personally view as being likely to happen quite close together, as Leopold also assumes) by enough to make more than a year-or-two’s difference.
The question is if research capable TAI can lag behind government-alarming long-horizon task capable AI (that does many jobs and so even Robin Hanson starts paying attention). These are two different thresholds that might both be called “AGI”, so it’s worth making a careful distinction. Even if it turns out that in practice they coincide and the same system becomes the first to qualify for both, for now we don’t know if that’s the case, and conceptually they are different.
If this lag is sufficient, governments might be able to succeed in locking down enough compute to prevent independent development of research capable TAI for many more years. This includes stopping or even reversing improvements in AI accelerators. If govenments only become alarmed once there is a research capable TAI, that gives the other possibility where TAI is developed by everyone very quickly and the opportunity to do it more carefully is lost.
Increasing investment is the crucial consideration in the sense that if research capable TAI is possible with modest investment, then there is no preventing its independent development. But if the necessary investment turns out to be sufficiently outrageous, controlling development of TAI by controlling hardware becomes feasible. Advancements in hardware are easy to control if most governments are alarmed, the supply chains are large, the datacenters are large. And algorithmic improvements have a sufficiently low ceiling to keep what would otherwise be $10 trillion training runs infeasible for independent actors even if done with better methods. The hypothetical I was describing has research capable TAI 2-3 OOMs above the $100 billion necessary for long-horizon task capable AI, which as a barrier for feasibility can survive some algorithmic improvements.
I also think the improvements themselves are probably running out. There’s only about 5x improvement in all these years for the dense transformer, a significant improvement from MoE, possibly some improvement from Mixture of Depths. All attention alternatives remain in the ballpark despite having very different architectures. Something significantly non-transformer-like is probably necessary to get more OOMs of algorithmic progress, which is also the case if LLMs can’t be scaled to research capable TAI at all.
(Recent unusually fast improvement in hardware was mostly driven by moving to lower precision, first BF16, then FP8 with H100s, and now Microscaling (FP4, FP6) with Blackwell. This process is also at an end, lower-level hardware improvement will be slower. But unlike algorithmic improvements, this point is irrelevant to the argument, since improvement in hardware available to independent actors can be stopped or reversed by governments, unlike algorithmic improvements.)
I disagree, though this is based on some guesswork (and Leopold’s analysis, as a recently-ex-insider). I don’t know exactly how they’re doing it (improvements in training data filtering is probably part of it), but the foundation model companies have all been putting out models with lower inference costs and latencies for the same capability level (OpenAI; GPT-4 Turbo, GPT-4o vs GPT-4; Anthropic Claude 3.5 Sonnet vs. the Claude 3 generation; Google: Gemini 1.5 vs 1). I am assuming that the reason for this performance improvement is that the newer models actually had lower parameter counts (which is supported by some rumored parameter count numbers), and I’m then also assuming that means these had lower total compute to train. (The latter assumption would be false for smaller models trained via distillation from a larger model, as some of the smaller Google models almost certainly are, or heavily overtrained by Chinchilla standards, as has recently become popular for models that are not the largest member of a model family.)
Things like the effectiveness of model pruning methods suggest that there are a lot of wasted parameters inside current models, which would suggest there’s still a lot of room for performance improvements. The huge context lengths that foundation model companies are now advertising without huge cost differentials also rather suggest something architectural has happened there, which isn’t just full attention quadratic-cost classical transformers. What combination of the techniques from the academic literature, or ones not in the academic literature, that’s based on is unclear, but clearly something improved there.
Algorithmic improvements relevant to my argument are those that happen after long-horizon task capable AIs are demonstrated, in particular it doesn’t matter how much progress is happening now, other than as evidence about what happens later.
This is necessarily part of it. It involves using more compute, not less, which is natural given that new training environments are getting online, and doesn’t need any algorithmic improvements at all to produce models that are both cheaper for inference and smarter. You can take a Chinchilla optimal model, make it 3x smaller and train it on 9x data, expending 3x more compute, and get approximately the same result. If you up the compute and data a bit more, the model will become more capable. Some current improvements are probably due to better use of pre-training data, but these things won’t survive significant further scaling intact. There are also improvements in post-training, but they are even less relevant to my argument, assuming they are not lagging behind too badly in unlocking the key thresholds of capability.
My apologies, you’re right, I had misunderstood you, and thus we’ve been talking at cross-purposes. You were discussing
while I was instead talking about how likely it was that running out of additional money to invest slowed reaching either of these forms of AGI (which I personally view as being likely to happen quite close together, as Leopold also assumes) by enough to make more than a year-or-two’s difference.