but have so far only found relatively incremental improvements to transformers (in the realm of 1000x improvement)
What 1000x improvement? Better hardware and larger scale are not algorithmic improvements. Careful study of scaling laws to get Chinchilla scaling and set tokens per parameter more reasonably[1] is not an algorithmic improvement. There was maybe 5x-20x algorithmic improvement, meaning the compute multiplier, how much less compute one would need to get the same perplexity on some test data. The upper bound is speculation based on published research for which there are no public results of large scale experiments, including for combinations of multiple methods, and absence of very strong compute multiplier results from developers of open weights models who publish detailed reports like DeepSeek and Meta. The lower bound can be observed in the Mamba paper (Figure 4, Transformer vs. Transformer++), though it doesn’t test MoE over dense transformer (which should be a further 2x or so, but I still don’t know of a paper that demonstrates this clearly).
Recent Yi-Lightning is an interesting example that wins on Chatbot Arena in multiple categories over all but a few of the strongest frontier GPT-4 level models (original GPT-4 itself is far behind). It was trained for about 2e24 FLOPs, 10x less than original GPT-4, and it’s a small overtrained model, so its tokens per parameter are very unfavorable, that is it was possible to make it even more capable with the same compute.
I think that if you take into account all the improvements to transformers published since their initial invention in the 2010s, that there is well over 1000x worth of improvement.
I can list a few of these advancements off the top of my head, but a comprehensive list would be a substantial project to assemble.
But that’s just in response to defending the point that there has been at least 1000x of improvement. My expectations of substantial improvement yet to come are based not just on this historical pattern, but also on reasoning about a potential for an ‘innovation overhang’ of valuable knowledge that can be gleaned from interpolating between existing research papers (something LLMs will likely soon be good enough for), and also from reasoning from my neuroscience background and some specific estimates of various parts of the brain in terms of compute efficiency and learning rates compared to models which do equivalent things.
I’m talking about the compute multiplier, as a measure of algorithmic improvement, how much less compute it takes to get to the same place. Half of these things are not relevant to it. Maybe another datapoint, Mosaic’s failure with DBRX, when their entire thing was hoarding compute multipliers.
Consider Llama-3-405B, a 4e25 FLOPs model that is just Transformer++ from the Mamba paper I referenced above, not even MoE. A compute multiplier of 1000x over the original transformer would be a 200x multiplier over this Llama, meaning matching its performance with 2e23 FLOPs (1.5 months of training on 128 H100s). Yi-Lightning is exceptional for its low 2e24 FLOPs compute (10x more than our target), but it feels like a lot of it is better post-training, subjectively it doesn’t appear quite as smart, so it would probably lose the perplexity competition.
I thought you might say that some of these weren’t relevant to the metric of compute efficiency you had in mind. I do think that these things are relevant to ‘compute it takes to get to a given capability level’.
Of course, what’s actually more important even than an improvement to training efficiency is an improvement to peak capability. I would argue that if Yi-Lightning, for example, had a better architecture than it does in terms of peak capability, then the gains from the additional training it was given would have been larger. There wouldn’t have been so much decreasing return to overtraining.
If it were possible to just keep training an existing transformer and have it keep getting smarter at a decent rate, then we’d probably be at AGI already. Just train GPT-4 10x as long.
I think a lot of people are seeing ways in which something about the architecture and/or training regime aren’t quite working for some key aspects of general intelligence. Particularly, reasoning and hyperpolation.
I do think that these things are relevant to ‘compute it takes to get to a given capability level’.
In practice, there are no 2e23 FLOPs models that cost $300K to train that are anywhere close to Llama-3-405B smart. If there were such models at leading labs (based on unpublished experimental results and more algorithmic insights), they would be much smarter than Llama3-405B when trained with 8e25 FLOPs they have to give, rather than the reference 2e23 FLOPs. Better choice of ways of answering questions doesn’t get us far in the actual technical capabilities.
(Post-training like o1 is a kind of “better choice of ways of answering questions” that might help, but we don’t know how much compute it saves. Noam Brown gestures at 100,000x from his earlier work, but we haven’t seen Llama 4 yet, it might just spontaneously become capable of coherent long reasoning traces as a result of more scale, the bitter lesson making Strawberry Team’s efforts moot.)
Many improvements observed at smaller scale disappear at greater scale, or don’t stack with each other. Many papers have horrible methodologies, plausibly born of scarcity of research compute, that don’t even try (or make it possible) to estimate the compute multiplier. Most of them will be eventually forgotten, for a good reason. So most papers that seem to demonstrate improvements are not strong evidence for the hypothesis of a 1000x cumulative compute efficiency improvement, while this hypothesis predicts observations about what’s actually already possible in practice that we are not getting, strong evidence against it. There are multiple competent teams that don’t have Microsoft compute, and they don’t win over Llama-3-405B, which we know doesn’t have all of these speculative algorithmic improvements and uses 4e25 FLOPs (2.5 months on 16K H100s rather than 1.5 months on 128 H100s for 2e23 FLOPs).
In other words, the importance of Llama-3-405B for the question about speculative algorithmic improvements is that the detailed report shows it has no secret sauce, it merely competently uses about as much compute as the leading labs in very conservative ways. And yet it’s close in capabilities to all the other frontier models. Which means the leading labs don’t have significantly effective secret sauce either, which means nobody does, since the leading labs would’ve already borrowed it if it was that effective.
There’s clearly a case in principle for it being possible to learn with much less data, anchoring to humans blind from birth. But there’s probably much more compute happening in a human brain per the proverbial external data token. And a human has the advantage of not learning everything about everything, with greater density of capability over encyclopedic knowledge, which should help save on compute.
I think we mostly agree, but there’s some difference in what we’re measuring against.
I agree that it really doesn’t appear that the leading labs have any secret sauce which is giving them more than 2x improvement over published algorithms.
I think that Llama 3 family does include a variety of improvements which have come along since “Attention is all you need” by Vaswani et al. 2017. Perhaps I am wrong that these improvements add up to 1000x improvement.
The more interesting question to me is why the big labs seem to have so little ‘secret sauce’ compared to open source knowledge. My guess is that the researchers in the major labs are timidly (pragmatically?) focusing on looking for improvements only in the search space very close to what’s already working. This might be the correct strategy, if you expect that pure scaling will get you to a sufficiently competent research agent to allow you to then very rapidly search a much wider space of possibilities. If you have the choice between digging a ditch by hand, or building a backhoe to dig for you....
Another critical question is whether there are radical improvements which are potentially discoverable by future LLM research agents. I believe that there are. Trying to lay out my arguments for this is a longer discussion.
Some sources which I think give hints about the thinking and focus of big lab researchers:
There should probably be a dialogue between you and @Vladimir_Nesov over how much algorithmic improvements actually work to make AI more powerful, since this might reveal cruxes and help everyone else prepare better for the various AI scenarios.
For what it’s worth, seems to me that Jack Clark of Anthropic is mostly in agreement with @Vladimir_Nesov about compute being the primary factor: Quoting from Jack’s blog here.
The world’s most capable open weight model is now made in China: …Tencent’s new Hunyuan model is a MoE triumph, and by some measures is world class… The world’s best open weight model might now be Chinese—that’s the takeaway from a recent Tencent paper that introduces Hunyuan-Large, a MoE model with 389 billion parameters (52 billion activated).
Why this matters—competency is everywhere, it’s just compute that matters: This paper seems generally very competent and sensible. The only key differentiator between this system and one trained in the West is compute—on the scaling law graph this model seems to come in somewhere between 10^24 and 10^25 flops of compute, whereas many Western frontier models are now sitting at between 10^25 and 10^26 flops. I think if this team of Tencent researchers had access to equivalent compute as Western counterparts then this wouldn’t just be a world class open weight model—it might be competitive with the far more experience proprietary models made by Anthropic, OpenAI, and so on. Read more:Hunyuan-Large: An Open-Source MoE Model with 52 Billion Activated Parameters by Tencent (arXiv).
Confounds Evaluation and Emergence
Ricardo Dominguez-Olmedo
Florian E. Dorner
Moritz Hardt Max Planck
Abstract
We study a fundamental problem in the evaluation of large language models that we call training on the test task. Unlike wrongful practices like training on the test data, leakage, or data contamination, training on the test task is not a malpractice. Rather, the term describes a growing set of techniques to include task-relevant data in the pretraining stage of a language model. We demonstrate that training on the test task confounds both relative model evaluations and claims about emergent capabilities. We argue that the seeming superiority of one model family over another may be explained by a different degree of training on the test task. To this end, we propose an effective method to adjust for training on the test task by fine-tuning each model under comparison on the same task-relevant data before evaluation. We then show that instances of emergent behavior largely vanish once we adjust for training on the test task. This also applies to reported instances of emergent behavior that cannot be explained by the choice of evaluation metric. Our work promotes a new perspective on the evaluation of large language models with broad implications for benchmarking and the study of emergent capabilities.
This updates me to think that a lot of the emergent behaviors that occured in LLMs probably had mostly mundane reasons, and most importantly this makes me think LLM capabilities might be more predictable than we think.
What 1000x improvement? Better hardware and larger scale are not algorithmic improvements. Careful study of scaling laws to get Chinchilla scaling and set tokens per parameter more reasonably[1] is not an algorithmic improvement. There was maybe 5x-20x algorithmic improvement, meaning the compute multiplier, how much less compute one would need to get the same perplexity on some test data. The upper bound is speculation based on published research for which there are no public results of large scale experiments, including for combinations of multiple methods, and absence of very strong compute multiplier results from developers of open weights models who publish detailed reports like DeepSeek and Meta. The lower bound can be observed in the Mamba paper (Figure 4, Transformer vs. Transformer++), though it doesn’t test MoE over dense transformer (which should be a further 2x or so, but I still don’t know of a paper that demonstrates this clearly).
Recent Yi-Lightning is an interesting example that wins on Chatbot Arena in multiple categories over all but a few of the strongest frontier GPT-4 level models (original GPT-4 itself is far behind). It was trained for about 2e24 FLOPs, 10x less than original GPT-4, and it’s a small overtrained model, so its tokens per parameter are very unfavorable, that is it was possible to make it even more capable with the same compute.
It’s not just 20 tokens per parameter.
I think that if you take into account all the improvements to transformers published since their initial invention in the 2010s, that there is well over 1000x worth of improvement.
I can list a few of these advancements off the top of my head, but a comprehensive list would be a substantial project to assemble.
Data Selection:
DeepMind JEST: https://arxiv.org/abs/2406.17711
SoftDedup: https://arxiv.org/abs/2407.06654
Activation function improvements, e.g. SwiGLU
FlashAttention: https://arxiv.org/abs/2205.14135
GrokFast: https://arxiv.org/html/2405.20233v2
AdEMAMix Optimizer https://arxiv.org/html/2409.03137v1
Quantized training
Better parallelism
DPO https://arxiv.org/abs/2305.18290
Hypersphere embedding https://arxiv.org/abs/2410.01131
Binary Tree MoEs: https://arxiv.org/abs/2311.10770 https://arxiv.org/abs/2407.04153
And a bunch of stuff in-the-works that may or may not pan out:
hybrid attention and state-space models (e.g. mixing in some Mamba layers)
multi-token prediction (including w potentially diffusion model guidance): https://arxiv.org/abs/2404.19737 https://arxiv.org/abs/2310.16834
Here’s a survey article with a bunch of further links: https://arxiv.org/abs/2302.01107
But that’s just in response to defending the point that there has been at least 1000x of improvement. My expectations of substantial improvement yet to come are based not just on this historical pattern, but also on reasoning about a potential for an ‘innovation overhang’ of valuable knowledge that can be gleaned from interpolating between existing research papers (something LLMs will likely soon be good enough for), and also from reasoning from my neuroscience background and some specific estimates of various parts of the brain in terms of compute efficiency and learning rates compared to models which do equivalent things.
I’m talking about the compute multiplier, as a measure of algorithmic improvement, how much less compute it takes to get to the same place. Half of these things are not relevant to it. Maybe another datapoint, Mosaic’s failure with DBRX, when their entire thing was hoarding compute multipliers.
Consider Llama-3-405B, a 4e25 FLOPs model that is just Transformer++ from the Mamba paper I referenced above, not even MoE. A compute multiplier of 1000x over the original transformer would be a 200x multiplier over this Llama, meaning matching its performance with 2e23 FLOPs (1.5 months of training on 128 H100s). Yi-Lightning is exceptional for its low 2e24 FLOPs compute (10x more than our target), but it feels like a lot of it is better post-training, subjectively it doesn’t appear quite as smart, so it would probably lose the perplexity competition.
I thought you might say that some of these weren’t relevant to the metric of compute efficiency you had in mind. I do think that these things are relevant to ‘compute it takes to get to a given capability level’.
Of course, what’s actually more important even than an improvement to training efficiency is an improvement to peak capability. I would argue that if Yi-Lightning, for example, had a better architecture than it does in terms of peak capability, then the gains from the additional training it was given would have been larger. There wouldn’t have been so much decreasing return to overtraining.
If it were possible to just keep training an existing transformer and have it keep getting smarter at a decent rate, then we’d probably be at AGI already. Just train GPT-4 10x as long.
I think a lot of people are seeing ways in which something about the architecture and/or training regime aren’t quite working for some key aspects of general intelligence. Particularly, reasoning and hyperpolation.
Some relevant things I have read:
reasoning limitations: https://arxiv.org/abs/2406.06489
hyperpolation: https://arxiv.org/abs/2409.05513
detailed analysis of logical errors made: https://www.youtube.com/watch?v=bpp6Dz8N2zY
Some relevant seeming things I haven’t yet read, where researchers are attempting to analyze or improve LLM reasoning:
https://arxiv.org/abs/2407.02678
https://arxiv.org/html/2406.11698v1
https://arxiv.org/abs/2402.11804
https://arxiv.org/abs/2401.14295
https://arxiv.org/abs/2405.15302
https://openreview.net/forum?id=wUU-7XTL5XO
https://arxiv.org/abs/2406.09308
https://arxiv.org/abs/2404.05221
https://arxiv.org/abs/2405.18512
In practice, there are no 2e23 FLOPs models that cost $300K to train that are anywhere close to Llama-3-405B smart. If there were such models at leading labs (based on unpublished experimental results and more algorithmic insights), they would be much smarter than Llama3-405B when trained with 8e25 FLOPs they have to give, rather than the reference 2e23 FLOPs. Better choice of ways of answering questions doesn’t get us far in the actual technical capabilities.
(Post-training like o1 is a kind of “better choice of ways of answering questions” that might help, but we don’t know how much compute it saves. Noam Brown gestures at 100,000x from his earlier work, but we haven’t seen Llama 4 yet, it might just spontaneously become capable of coherent long reasoning traces as a result of more scale, the bitter lesson making Strawberry Team’s efforts moot.)
Many improvements observed at smaller scale disappear at greater scale, or don’t stack with each other. Many papers have horrible methodologies, plausibly born of scarcity of research compute, that don’t even try (or make it possible) to estimate the compute multiplier. Most of them will be eventually forgotten, for a good reason. So most papers that seem to demonstrate improvements are not strong evidence for the hypothesis of a 1000x cumulative compute efficiency improvement, while this hypothesis predicts observations about what’s actually already possible in practice that we are not getting, strong evidence against it. There are multiple competent teams that don’t have Microsoft compute, and they don’t win over Llama-3-405B, which we know doesn’t have all of these speculative algorithmic improvements and uses 4e25 FLOPs (2.5 months on 16K H100s rather than 1.5 months on 128 H100s for 2e23 FLOPs).
In other words, the importance of Llama-3-405B for the question about speculative algorithmic improvements is that the detailed report shows it has no secret sauce, it merely competently uses about as much compute as the leading labs in very conservative ways. And yet it’s close in capabilities to all the other frontier models. Which means the leading labs don’t have significantly effective secret sauce either, which means nobody does, since the leading labs would’ve already borrowed it if it was that effective.
There’s clearly a case in principle for it being possible to learn with much less data, anchoring to humans blind from birth. But there’s probably much more compute happening in a human brain per the proverbial external data token. And a human has the advantage of not learning everything about everything, with greater density of capability over encyclopedic knowledge, which should help save on compute.
I think we mostly agree, but there’s some difference in what we’re measuring against.
I agree that it really doesn’t appear that the leading labs have any secret sauce which is giving them more than 2x improvement over published algorithms.
I think that Llama 3 family does include a variety of improvements which have come along since “Attention is all you need” by Vaswani et al. 2017. Perhaps I am wrong that these improvements add up to 1000x improvement.
The more interesting question to me is why the big labs seem to have so little ‘secret sauce’ compared to open source knowledge. My guess is that the researchers in the major labs are timidly (pragmatically?) focusing on looking for improvements only in the search space very close to what’s already working. This might be the correct strategy, if you expect that pure scaling will get you to a sufficiently competent research agent to allow you to then very rapidly search a much wider space of possibilities. If you have the choice between digging a ditch by hand, or building a backhoe to dig for you....
Another critical question is whether there are radical improvements which are potentially discoverable by future LLM research agents. I believe that there are. Trying to lay out my arguments for this is a longer discussion.
Some sources which I think give hints about the thinking and focus of big lab researchers:
https://www.youtube.com/watch?v=UTuuTTnjxMQ
https://braininspired.co/podcast/193/
Some sources on ideas which go beyond the nearby idea-space of transformers:
https://www.youtube.com/watch?v=YLiXgPhb8cQ
https://arxiv.org/abs/2408.10205
There should probably be a dialogue between you and @Vladimir_Nesov over how much algorithmic improvements actually work to make AI more powerful, since this might reveal cruxes and help everyone else prepare better for the various AI scenarios.
For what it’s worth, seems to me that Jack Clark of Anthropic is mostly in agreement with @Vladimir_Nesov about compute being the primary factor:
Quoting from Jack’s blog here.
Another data point supporting Vladimir and Jack Clark’s view of training compute being the key factor:
https://arxiv.org/html/2407.07890v1
Confounds Evaluation and Emergence Ricardo Dominguez-Olmedo Florian E. Dorner Moritz Hardt Max Planck
Abstract
This updates me to think that a lot of the emergent behaviors that occured in LLMs probably had mostly mundane reasons, and most importantly this makes me think LLM capabilities might be more predictable than we think.