I do think that these things are relevant to ‘compute it takes to get to a given capability level’.
In practice, there are no 2e23 FLOPs models that cost $300K to train that are anywhere close to Llama-3-405B smart. If there were such models at leading labs (based on unpublished experimental results and more algorithmic insights), they would be much smarter than Llama3-405B when trained with 8e25 FLOPs they have to give, rather than the reference 2e23 FLOPs. Better choice of ways of answering questions doesn’t get us far in the actual technical capabilities.
(Post-training like o1 is a kind of “better choice of ways of answering questions” that might help, but we don’t know how much compute it saves. Noam Brown gestures at 100,000x from his earlier work, but we haven’t seen Llama 4 yet, it might just spontaneously become capable of coherent long reasoning traces as a result of more scale, the bitter lesson making Strawberry Team’s efforts moot.)
Many improvements observed at smaller scale disappear at greater scale, or don’t stack with each other. Many papers have horrible methodologies, plausibly born of scarcity of research compute, that don’t even try (or make it possible) to estimate the compute multiplier. Most of them will be eventually forgotten, for a good reason. So most papers that seem to demonstrate improvements are not strong evidence for the hypothesis of a 1000x cumulative compute efficiency improvement, while this hypothesis predicts observations about what’s actually already possible in practice that we are not getting, strong evidence against it. There are multiple competent teams that don’t have Microsoft compute, and they don’t win over Llama-3-405B, which we know doesn’t have all of these speculative algorithmic improvements and uses 4e25 FLOPs (2.5 months on 16K H100s rather than 1.5 months on 128 H100s for 2e23 FLOPs).
In other words, the importance of Llama-3-405B for the question about speculative algorithmic improvements is that the detailed report shows it has no secret sauce, it merely competently uses about as much compute as the leading labs in very conservative ways. And yet it’s close in capabilities to all the other frontier models. Which means the leading labs don’t have significantly effective secret sauce either, which means nobody does, since the leading labs would’ve already borrowed it if it was that effective.
There’s clearly a case in principle for it being possible to learn with much less data, anchoring to humans blind from birth. But there’s probably much more compute happening in a human brain per the proverbial external data token. And a human has the advantage of not learning everything about everything, with greater density of capability over encyclopedic knowledge, which should help save on compute.
I think we mostly agree, but there’s some difference in what we’re measuring against.
I agree that it really doesn’t appear that the leading labs have any secret sauce which is giving them more than 2x improvement over published algorithms.
I think that Llama 3 family does include a variety of improvements which have come along since “Attention is all you need” by Vaswani et al. 2017. Perhaps I am wrong that these improvements add up to 1000x improvement.
The more interesting question to me is why the big labs seem to have so little ‘secret sauce’ compared to open source knowledge. My guess is that the researchers in the major labs are timidly (pragmatically?) focusing on looking for improvements only in the search space very close to what’s already working. This might be the correct strategy, if you expect that pure scaling will get you to a sufficiently competent research agent to allow you to then very rapidly search a much wider space of possibilities. If you have the choice between digging a ditch by hand, or building a backhoe to dig for you....
Another critical question is whether there are radical improvements which are potentially discoverable by future LLM research agents. I believe that there are. Trying to lay out my arguments for this is a longer discussion.
Some sources which I think give hints about the thinking and focus of big lab researchers:
There should probably be a dialogue between you and @Vladimir_Nesov over how much algorithmic improvements actually work to make AI more powerful, since this might reveal cruxes and help everyone else prepare better for the various AI scenarios.
For what it’s worth, seems to me that Jack Clark of Anthropic is mostly in agreement with @Vladimir_Nesov about compute being the primary factor: Quoting from Jack’s blog here.
The world’s most capable open weight model is now made in China: …Tencent’s new Hunyuan model is a MoE triumph, and by some measures is world class… The world’s best open weight model might now be Chinese—that’s the takeaway from a recent Tencent paper that introduces Hunyuan-Large, a MoE model with 389 billion parameters (52 billion activated).
Why this matters—competency is everywhere, it’s just compute that matters: This paper seems generally very competent and sensible. The only key differentiator between this system and one trained in the West is compute—on the scaling law graph this model seems to come in somewhere between 10^24 and 10^25 flops of compute, whereas many Western frontier models are now sitting at between 10^25 and 10^26 flops. I think if this team of Tencent researchers had access to equivalent compute as Western counterparts then this wouldn’t just be a world class open weight model—it might be competitive with the far more experience proprietary models made by Anthropic, OpenAI, and so on. Read more:Hunyuan-Large: An Open-Source MoE Model with 52 Billion Activated Parameters by Tencent (arXiv).
Confounds Evaluation and Emergence
Ricardo Dominguez-Olmedo
Florian E. Dorner
Moritz Hardt Max Planck
Abstract
We study a fundamental problem in the evaluation of large language models that we call training on the test task. Unlike wrongful practices like training on the test data, leakage, or data contamination, training on the test task is not a malpractice. Rather, the term describes a growing set of techniques to include task-relevant data in the pretraining stage of a language model. We demonstrate that training on the test task confounds both relative model evaluations and claims about emergent capabilities. We argue that the seeming superiority of one model family over another may be explained by a different degree of training on the test task. To this end, we propose an effective method to adjust for training on the test task by fine-tuning each model under comparison on the same task-relevant data before evaluation. We then show that instances of emergent behavior largely vanish once we adjust for training on the test task. This also applies to reported instances of emergent behavior that cannot be explained by the choice of evaluation metric. Our work promotes a new perspective on the evaluation of large language models with broad implications for benchmarking and the study of emergent capabilities.
This updates me to think that a lot of the emergent behaviors that occured in LLMs probably had mostly mundane reasons, and most importantly this makes me think LLM capabilities might be more predictable than we think.
In practice, there are no 2e23 FLOPs models that cost $300K to train that are anywhere close to Llama-3-405B smart. If there were such models at leading labs (based on unpublished experimental results and more algorithmic insights), they would be much smarter than Llama3-405B when trained with 8e25 FLOPs they have to give, rather than the reference 2e23 FLOPs. Better choice of ways of answering questions doesn’t get us far in the actual technical capabilities.
(Post-training like o1 is a kind of “better choice of ways of answering questions” that might help, but we don’t know how much compute it saves. Noam Brown gestures at 100,000x from his earlier work, but we haven’t seen Llama 4 yet, it might just spontaneously become capable of coherent long reasoning traces as a result of more scale, the bitter lesson making Strawberry Team’s efforts moot.)
Many improvements observed at smaller scale disappear at greater scale, or don’t stack with each other. Many papers have horrible methodologies, plausibly born of scarcity of research compute, that don’t even try (or make it possible) to estimate the compute multiplier. Most of them will be eventually forgotten, for a good reason. So most papers that seem to demonstrate improvements are not strong evidence for the hypothesis of a 1000x cumulative compute efficiency improvement, while this hypothesis predicts observations about what’s actually already possible in practice that we are not getting, strong evidence against it. There are multiple competent teams that don’t have Microsoft compute, and they don’t win over Llama-3-405B, which we know doesn’t have all of these speculative algorithmic improvements and uses 4e25 FLOPs (2.5 months on 16K H100s rather than 1.5 months on 128 H100s for 2e23 FLOPs).
In other words, the importance of Llama-3-405B for the question about speculative algorithmic improvements is that the detailed report shows it has no secret sauce, it merely competently uses about as much compute as the leading labs in very conservative ways. And yet it’s close in capabilities to all the other frontier models. Which means the leading labs don’t have significantly effective secret sauce either, which means nobody does, since the leading labs would’ve already borrowed it if it was that effective.
There’s clearly a case in principle for it being possible to learn with much less data, anchoring to humans blind from birth. But there’s probably much more compute happening in a human brain per the proverbial external data token. And a human has the advantage of not learning everything about everything, with greater density of capability over encyclopedic knowledge, which should help save on compute.
I think we mostly agree, but there’s some difference in what we’re measuring against.
I agree that it really doesn’t appear that the leading labs have any secret sauce which is giving them more than 2x improvement over published algorithms.
I think that Llama 3 family does include a variety of improvements which have come along since “Attention is all you need” by Vaswani et al. 2017. Perhaps I am wrong that these improvements add up to 1000x improvement.
The more interesting question to me is why the big labs seem to have so little ‘secret sauce’ compared to open source knowledge. My guess is that the researchers in the major labs are timidly (pragmatically?) focusing on looking for improvements only in the search space very close to what’s already working. This might be the correct strategy, if you expect that pure scaling will get you to a sufficiently competent research agent to allow you to then very rapidly search a much wider space of possibilities. If you have the choice between digging a ditch by hand, or building a backhoe to dig for you....
Another critical question is whether there are radical improvements which are potentially discoverable by future LLM research agents. I believe that there are. Trying to lay out my arguments for this is a longer discussion.
Some sources which I think give hints about the thinking and focus of big lab researchers:
https://www.youtube.com/watch?v=UTuuTTnjxMQ
https://braininspired.co/podcast/193/
Some sources on ideas which go beyond the nearby idea-space of transformers:
https://www.youtube.com/watch?v=YLiXgPhb8cQ
https://arxiv.org/abs/2408.10205
There should probably be a dialogue between you and @Vladimir_Nesov over how much algorithmic improvements actually work to make AI more powerful, since this might reveal cruxes and help everyone else prepare better for the various AI scenarios.
For what it’s worth, seems to me that Jack Clark of Anthropic is mostly in agreement with @Vladimir_Nesov about compute being the primary factor:
Quoting from Jack’s blog here.
Another data point supporting Vladimir and Jack Clark’s view of training compute being the key factor:
https://arxiv.org/html/2407.07890v1
Confounds Evaluation and Emergence Ricardo Dominguez-Olmedo Florian E. Dorner Moritz Hardt Max Planck
Abstract
This updates me to think that a lot of the emergent behaviors that occured in LLMs probably had mostly mundane reasons, and most importantly this makes me think LLM capabilities might be more predictable than we think.