I think we mostly agree, but there’s some difference in what we’re measuring against.
I agree that it really doesn’t appear that the leading labs have any secret sauce which is giving them more than 2x improvement over published algorithms.
I think that Llama 3 family does include a variety of improvements which have come along since “Attention is all you need” by Vaswani et al. 2017. Perhaps I am wrong that these improvements add up to 1000x improvement.
The more interesting question to me is why the big labs seem to have so little ‘secret sauce’ compared to open source knowledge. My guess is that the researchers in the major labs are timidly (pragmatically?) focusing on looking for improvements only in the search space very close to what’s already working. This might be the correct strategy, if you expect that pure scaling will get you to a sufficiently competent research agent to allow you to then very rapidly search a much wider space of possibilities. If you have the choice between digging a ditch by hand, or building a backhoe to dig for you....
Another critical question is whether there are radical improvements which are potentially discoverable by future LLM research agents. I believe that there are. Trying to lay out my arguments for this is a longer discussion.
Some sources which I think give hints about the thinking and focus of big lab researchers:
There should probably be a dialogue between you and @Vladimir_Nesov over how much algorithmic improvements actually work to make AI more powerful, since this might reveal cruxes and help everyone else prepare better for the various AI scenarios.
For what it’s worth, seems to me that Jack Clark of Anthropic is mostly in agreement with @Vladimir_Nesov about compute being the primary factor: Quoting from Jack’s blog here.
The world’s most capable open weight model is now made in China: …Tencent’s new Hunyuan model is a MoE triumph, and by some measures is world class… The world’s best open weight model might now be Chinese—that’s the takeaway from a recent Tencent paper that introduces Hunyuan-Large, a MoE model with 389 billion parameters (52 billion activated).
Why this matters—competency is everywhere, it’s just compute that matters: This paper seems generally very competent and sensible. The only key differentiator between this system and one trained in the West is compute—on the scaling law graph this model seems to come in somewhere between 10^24 and 10^25 flops of compute, whereas many Western frontier models are now sitting at between 10^25 and 10^26 flops. I think if this team of Tencent researchers had access to equivalent compute as Western counterparts then this wouldn’t just be a world class open weight model—it might be competitive with the far more experience proprietary models made by Anthropic, OpenAI, and so on. Read more:Hunyuan-Large: An Open-Source MoE Model with 52 Billion Activated Parameters by Tencent (arXiv).
I think we mostly agree, but there’s some difference in what we’re measuring against.
I agree that it really doesn’t appear that the leading labs have any secret sauce which is giving them more than 2x improvement over published algorithms.
I think that Llama 3 family does include a variety of improvements which have come along since “Attention is all you need” by Vaswani et al. 2017. Perhaps I am wrong that these improvements add up to 1000x improvement.
The more interesting question to me is why the big labs seem to have so little ‘secret sauce’ compared to open source knowledge. My guess is that the researchers in the major labs are timidly (pragmatically?) focusing on looking for improvements only in the search space very close to what’s already working. This might be the correct strategy, if you expect that pure scaling will get you to a sufficiently competent research agent to allow you to then very rapidly search a much wider space of possibilities. If you have the choice between digging a ditch by hand, or building a backhoe to dig for you....
Another critical question is whether there are radical improvements which are potentially discoverable by future LLM research agents. I believe that there are. Trying to lay out my arguments for this is a longer discussion.
Some sources which I think give hints about the thinking and focus of big lab researchers:
https://www.youtube.com/watch?v=UTuuTTnjxMQ
https://braininspired.co/podcast/193/
Some sources on ideas which go beyond the nearby idea-space of transformers:
https://www.youtube.com/watch?v=YLiXgPhb8cQ
https://arxiv.org/abs/2408.10205
There should probably be a dialogue between you and @Vladimir_Nesov over how much algorithmic improvements actually work to make AI more powerful, since this might reveal cruxes and help everyone else prepare better for the various AI scenarios.
For what it’s worth, seems to me that Jack Clark of Anthropic is mostly in agreement with @Vladimir_Nesov about compute being the primary factor:
Quoting from Jack’s blog here.