The release of Llama 405b was the thing that most succinctly explained this to me. At least when it comes to the current generation of cutting edge LLMs, there is no secret sauce. Llama 405b is a cutting edge model with, as far as I can tell, no advances in architecture or training compared to the development of GPT-3. Indeed, it appears in architecture substantially simpler than GPT-4 while outperforming it, suggesting that in the long-run, simplicity of architecture tends to win out, especially if you are willing to take a relatively small (<3x) compute-cost hit.
The architecture is a straightforward transformer with no mixture of experts or anything fancy:
The training process did nothing interesting. It used the most obvious implementation of supervised fine-tuning and reinforcement training.
The data cleaning process was somewhat more involved, and we know less about, but I think is unlikely to have done anything like synthetic data generation or complicated AI-assisted review.
This might all again change with the next generation of LLMs (especially with things like Strawberry, which looks like it might do something more interesting), but at least right now, I think almost any competent engineering team in the world could build a cutting-edge AI model, if they were just willing to spend the compute. It requires overcoming some minor engineering challenges, but the basics of how to do this are figured out. There is no moat.
@ryan_greenblatt: Curious if you have a quick example of an architectural change from GPT-3. Quick googling/perplexing maybe suggests some changes in the attention algorithm (grouped-query attention instead of whatever GPT-3 was doing).
I was trying to just highlight “training” rather than architecture. I think there are architecture changes (swigelu, grouped-query attention, probably somewhat better tuned transformer hparams like layer count etc.) though these are perhaps minor.
My understanding of the key training advances relative to GPT3:
Closer to chinchilla optimal via having enough data. (I think 405b is 2x too much data according to chinchilla while GPT3 is 8x too little data.)
Better data. The paper says “Compared to prior versions of Llama (Touvron et al., 2023a,b), we improved both the quantity and quality of the data we use for pre-training and post-training.”
I think 405b is 2x too much data according to chinchilla while GPT3 is 8x too little data
They did the Chinchilla scaling experiments themselves, it’s in the report (Section 3.2.1 Scaling Laws). The result claims that 40 tokens/parameter is actually optimal in their setup (2x more than in the Chinchilla paper), so Llama-3-405b is Chinchilla optimal in the relevant sense, it’s not trained on too much data. The result is slightly suspicious in that their largest datapoints are 1e22 FLOPs, while Llama-3-405b itself is 4e25 FLOPs, so that’s a lot of extrapolation. But overall they find that the optimal tokens/parameter ratio increases with compute, more so than in the Chinchilla paper, and Llama-3-405b had more compute than Chinchilla.
Another interesting finding is the optimal number of tokens per parameter. We found this optimal number to be slightly increasing across our range of experiments (see the dashed black line). Note that our methodology differed from that of Chinchilla in a few significant ways: we explicitly scaled the number of machines together with the model size, effectively changing the batch size.
Ah, sorry, yeah, I basically agree with this. I do think the scaling law stuff made a big difference. I commented a bit on the training data stuff, but my best guess is the changes there are also minor (besides the sheer volume).
Keep in mind that if, hypothetically, there were major compute efficiency tricks to be had, they would likely not be shared publicly. So the absence of publicly known techniques is not strong evidence in either direction.
Also, in general I start from a prior of being skeptical of papers claiming their models are comparable/better than GPT-4. It’s very easy to mislead with statistics—for example, human preference comparisons depend very heavily on the task distribution, and how discerning the raters are. I have not specifically looked deeply into Llama 405B though.
That’s true, though I do think there are various proxies that make at least the extreme end of this kind of thing for currently deployed models relatively easy to rule out (like the compute-purchase and allocation decisions of major cloud providers who host some of these models, and staff allocation and various other things).
I do think most organizations who claim parity with GPT-4 or Sonnet are almost always overstating things. My experience with 405b suggests it is also not at the level of Claude 3.5 Sonnet, but it does seem to be at the level of the original GPT-4, though I am not confident since I haven’t played around that much with it GPT-4 recently.
Yeah, I mostly agree. I would say that there may or may not be certain secret techniques which will give models a slightly lower loss plateau for a given parameter count. That matters more to the large companies than compute efficiency, I think.
Accumulate enough loss-plateau-lowering tidbits, and it could add up to having the best model out of a group of similarly sized models.
The release of Llama 405b was the thing that most succinctly explained this to me. At least when it comes to the current generation of cutting edge LLMs, there is no secret sauce. Llama 405b is a cutting edge model with, as far as I can tell, no advances in architecture or training compared to the development of GPT-3. Indeed, it appears in architecture substantially simpler than GPT-4 while outperforming it, suggesting that in the long-run, simplicity of architecture tends to win out, especially if you are willing to take a relatively small (<3x) compute-cost hit.
The architecture is a straightforward transformer with no mixture of experts or anything fancy:
The training process did nothing interesting. It used the most obvious implementation of supervised fine-tuning and reinforcement training.
The data cleaning process was somewhat more involved, and we know less about, but I think is unlikely to have done anything like synthetic data generation or complicated AI-assisted review.
This might all again change with the next generation of LLMs (especially with things like Strawberry, which looks like it might do something more interesting), but at least right now, I think almost any competent engineering team in the world could build a cutting-edge AI model, if they were just willing to spend the compute. It requires overcoming some minor engineering challenges, but the basics of how to do this are figured out. There is no moat.
Llama 405B was trained on a bunch of synthetic data in post-training for coding, long-context prompts, and tool use (see section 4.3 of the paper).
@ryan_greenblatt: Curious if you have a quick example of an architectural change from GPT-3. Quick googling/perplexing maybe suggests some changes in the attention algorithm (grouped-query attention instead of whatever GPT-3 was doing).
I was trying to just highlight “training” rather than architecture. I think there are architecture changes (swigelu, grouped-query attention, probably somewhat better tuned transformer hparams like layer count etc.) though these are perhaps minor.
My understanding of the key training advances relative to GPT3:
Closer to chinchilla optimal via having enough data. (I think 405b is 2x too much data according to chinchilla while GPT3 is 8x too little data.)
Better data. The paper says “Compared to prior versions of Llama (Touvron et al., 2023a,b), we improved both the quantity and quality of the data we use for pre-training and post-training.”
They did the Chinchilla scaling experiments themselves, it’s in the report (Section 3.2.1 Scaling Laws). The result claims that 40 tokens/parameter is actually optimal in their setup (2x more than in the Chinchilla paper), so Llama-3-405b is Chinchilla optimal in the relevant sense, it’s not trained on too much data. The result is slightly suspicious in that their largest datapoints are 1e22 FLOPs, while Llama-3-405b itself is 4e25 FLOPs, so that’s a lot of extrapolation. But overall they find that the optimal tokens/parameter ratio increases with compute, more so than in the Chinchilla paper, and Llama-3-405b had more compute than Chinchilla.
This is also consistent with the CARBS experiments done by Imbue (search for “tokens per parameter”):
Ah, sorry, yeah, I basically agree with this. I do think the scaling law stuff made a big difference. I commented a bit on the training data stuff, but my best guess is the changes there are also minor (besides the sheer volume).
Keep in mind that if, hypothetically, there were major compute efficiency tricks to be had, they would likely not be shared publicly. So the absence of publicly known techniques is not strong evidence in either direction.
Also, in general I start from a prior of being skeptical of papers claiming their models are comparable/better than GPT-4. It’s very easy to mislead with statistics—for example, human preference comparisons depend very heavily on the task distribution, and how discerning the raters are. I have not specifically looked deeply into Llama 405B though.
That’s true, though I do think there are various proxies that make at least the extreme end of this kind of thing for currently deployed models relatively easy to rule out (like the compute-purchase and allocation decisions of major cloud providers who host some of these models, and staff allocation and various other things).
I do think most organizations who claim parity with GPT-4 or Sonnet are almost always overstating things. My experience with 405b suggests it is also not at the level of Claude 3.5 Sonnet, but it does seem to be at the level of the original GPT-4, though I am not confident since I haven’t played around that much with it GPT-4 recently.
Yeah, I mostly agree. I would say that there may or may not be certain secret techniques which will give models a slightly lower loss plateau for a given parameter count. That matters more to the large companies than compute efficiency, I think.
Accumulate enough loss-plateau-lowering tidbits, and it could add up to having the best model out of a group of similarly sized models.