I also think the improvements themselves are probably running out.
I disagree, though this is based on some guesswork (and Leopold’s analysis, as a recently-ex-insider). I don’t know exactly how they’re doing it (improvements in training data filtering is probably part of it), but the foundation model companies have all been putting out models with lower inference costs and latencies for the same capability level (OpenAI; GPT-4 Turbo, GPT-4o vs GPT-4; Anthropic Claude 3.5 Sonnet vs. the Claude 3 generation; Google: Gemini 1.5 vs 1). I am assuming that the reason for this performance improvement is that the newer models actually had lower parameter counts (which is supported by some rumored parameter count numbers), and I’m then also assuming that means these had lower total compute to train. (The latter assumption would be false for smaller models trained via distillation from a larger model, as some of the smaller Google models almost certainly are, or heavily overtrained by Chinchilla standards, as has recently become popular for models that are not the largest member of a model family.)
Things like the effectiveness of model pruning methods suggest that there are a lot of wasted parameters inside current models, which would suggest there’s still a lot of room for performance improvements. The huge context lengths that foundation model companies are now advertising without huge cost differentials also rather suggest something architectural has happened there, which isn’t just full attention quadratic-cost classical transformers. What combination of the techniques from the academic literature, or ones not in the academic literature, that’s based on is unclear, but clearly something improved there.
Algorithmic improvements relevant to my argument are those that happen after long-horizon task capable AIs are demonstrated, in particular it doesn’t matter how much progress is happening now, other than as evidence about what happens later.
heavily overtrained by Chinchilla standards
This is necessarily part of it. It involves using more compute, not less, which is natural given that new training environments are getting online, and doesn’t need any algorithmic improvements at all to produce models that are both cheaper for inference and smarter. You can take a Chinchilla optimal model, make it 3x smaller and train it on 9x data, expending 3x more compute, and get approximately the same result. If you up the compute and data a bit more, the model will become more capable. Some current improvements are probably due to better use of pre-training data, but these things won’t survive significant further scaling intact. There are also improvements in post-training, but they are even less relevant to my argument, assuming they are not lagging behind too badly in unlocking the key thresholds of capability.
Algorithmic improvements relevant to my argument are those that happen after long-horizon task capable AIs are demonstrated, in particular it doesn’t matter how much progress is happening now, other than as evidence about what happens later
My apologies, you’re right, I had misunderstood you, and thus we’ve been talking at cross-purposes. You were discussing
…if research capable TAI can lag behind government-alarming long-horizon task capable AI (that does many jobs and so even Robin Hanson starts paying attention)
while I was instead talking about how likely it was that running out of additional money to invest slowed reaching either of these forms of AGI (which I personally view as being likely to happen quite close together, as Leopold also assumes) by enough to make more than a year-or-two’s difference.
I disagree, though this is based on some guesswork (and Leopold’s analysis, as a recently-ex-insider). I don’t know exactly how they’re doing it (improvements in training data filtering is probably part of it), but the foundation model companies have all been putting out models with lower inference costs and latencies for the same capability level (OpenAI; GPT-4 Turbo, GPT-4o vs GPT-4; Anthropic Claude 3.5 Sonnet vs. the Claude 3 generation; Google: Gemini 1.5 vs 1). I am assuming that the reason for this performance improvement is that the newer models actually had lower parameter counts (which is supported by some rumored parameter count numbers), and I’m then also assuming that means these had lower total compute to train. (The latter assumption would be false for smaller models trained via distillation from a larger model, as some of the smaller Google models almost certainly are, or heavily overtrained by Chinchilla standards, as has recently become popular for models that are not the largest member of a model family.)
Things like the effectiveness of model pruning methods suggest that there are a lot of wasted parameters inside current models, which would suggest there’s still a lot of room for performance improvements. The huge context lengths that foundation model companies are now advertising without huge cost differentials also rather suggest something architectural has happened there, which isn’t just full attention quadratic-cost classical transformers. What combination of the techniques from the academic literature, or ones not in the academic literature, that’s based on is unclear, but clearly something improved there.
Algorithmic improvements relevant to my argument are those that happen after long-horizon task capable AIs are demonstrated, in particular it doesn’t matter how much progress is happening now, other than as evidence about what happens later.
This is necessarily part of it. It involves using more compute, not less, which is natural given that new training environments are getting online, and doesn’t need any algorithmic improvements at all to produce models that are both cheaper for inference and smarter. You can take a Chinchilla optimal model, make it 3x smaller and train it on 9x data, expending 3x more compute, and get approximately the same result. If you up the compute and data a bit more, the model will become more capable. Some current improvements are probably due to better use of pre-training data, but these things won’t survive significant further scaling intact. There are also improvements in post-training, but they are even less relevant to my argument, assuming they are not lagging behind too badly in unlocking the key thresholds of capability.
My apologies, you’re right, I had misunderstood you, and thus we’ve been talking at cross-purposes. You were discussing
while I was instead talking about how likely it was that running out of additional money to invest slowed reaching either of these forms of AGI (which I personally view as being likely to happen quite close together, as Leopold also assumes) by enough to make more than a year-or-two’s difference.