I was trying to just highlight “training” rather than architecture. I think there are architecture changes (swigelu, grouped-query attention, probably somewhat better tuned transformer hparams like layer count etc.) though these are perhaps minor.
My understanding of the key training advances relative to GPT3:
Closer to chinchilla optimal via having enough data. (I think 405b is 2x too much data according to chinchilla while GPT3 is 8x too little data.)
Better data. The paper says “Compared to prior versions of Llama (Touvron et al., 2023a,b), we improved both the quantity and quality of the data we use for pre-training and post-training.”
I think 405b is 2x too much data according to chinchilla while GPT3 is 8x too little data
They did the Chinchilla scaling experiments themselves, it’s in the report (Section 3.2.1 Scaling Laws). The result claims that 40 tokens/parameter is actually optimal in their setup (2x more than in the Chinchilla paper), so Llama-3-405b is Chinchilla optimal in the relevant sense, it’s not trained on too much data. The result is slightly suspicious in that their largest datapoints are 1e22 FLOPs, while Llama-3-405b itself is 4e25 FLOPs, so that’s a lot of extrapolation. But overall they find that the optimal tokens/parameter ratio increases with compute, more so than in the Chinchilla paper, and Llama-3-405b had more compute than Chinchilla.
Another interesting finding is the optimal number of tokens per parameter. We found this optimal number to be slightly increasing across our range of experiments (see the dashed black line). Note that our methodology differed from that of Chinchilla in a few significant ways: we explicitly scaled the number of machines together with the model size, effectively changing the batch size.
Ah, sorry, yeah, I basically agree with this. I do think the scaling law stuff made a big difference. I commented a bit on the training data stuff, but my best guess is the changes there are also minor (besides the sheer volume).
I was trying to just highlight “training” rather than architecture. I think there are architecture changes (swigelu, grouped-query attention, probably somewhat better tuned transformer hparams like layer count etc.) though these are perhaps minor.
My understanding of the key training advances relative to GPT3:
Closer to chinchilla optimal via having enough data. (I think 405b is 2x too much data according to chinchilla while GPT3 is 8x too little data.)
Better data. The paper says “Compared to prior versions of Llama (Touvron et al., 2023a,b), we improved both the quantity and quality of the data we use for pre-training and post-training.”
They did the Chinchilla scaling experiments themselves, it’s in the report (Section 3.2.1 Scaling Laws). The result claims that 40 tokens/parameter is actually optimal in their setup (2x more than in the Chinchilla paper), so Llama-3-405b is Chinchilla optimal in the relevant sense, it’s not trained on too much data. The result is slightly suspicious in that their largest datapoints are 1e22 FLOPs, while Llama-3-405b itself is 4e25 FLOPs, so that’s a lot of extrapolation. But overall they find that the optimal tokens/parameter ratio increases with compute, more so than in the Chinchilla paper, and Llama-3-405b had more compute than Chinchilla.
This is also consistent with the CARBS experiments done by Imbue (search for “tokens per parameter”):
Ah, sorry, yeah, I basically agree with this. I do think the scaling law stuff made a big difference. I commented a bit on the training data stuff, but my best guess is the changes there are also minor (besides the sheer volume).