You’re right, the idea that multiple epochs can’t possibly help is one of the weakest links in the post. Sometime soon I hope to edit the post with a correction / expansion of that discussion, but I need to collect my thoughts more first—I’m kinda confused by this too.
After thinking more about it, I agree that the repeated-data papers don’t provide much evidence that multiple epochs are harmful.
For example, although the Anthropic repeated-data paperdoes consider cases where a non-small fraction of total training tokens are repeated more than once. In their most extreme case,
half of the training tokens are never repeated during training, and
the other half of training tokens are some (smaller) portion of the original dataset, repeated 2 or more times
But this effectively lowers the total size of the model’s training dataset—the number of training tokens is held constant (100B), so the repeated copies are taking up space that would otherwise be used for fresh data. For example, if the repeated tokens are repeated 2 times, then we are only using 3⁄4 of the data we could be (we select 1⁄2 for the unrepeated part, and then select 1⁄4 and repeat it twice for the other part).
We’d expect this to hurt the model, and to hurt larger models more, which explains some fraction of the observed effect.
I think there’s a much stronger case that multiple epochs are surprisingly unhelpful for large models, even if they aren’t harmful. I went over that case in this post. (Which was based on the earlier Kaplan et al papers, but I think the basic result still holds.)
However, multiple epochs do help, just less so as N grows… so even if they are negligibly helpful at GPT-3 size or above, they still might be relevantly helpful at Chinchilla size or below. (And this would then push the compute optimal N even further down relative to Chinchilla, preferring smaller models + more steps.)
It would be really nice to see an extension of the Chinchilla experiment that tried multiple epochs, which would directly answer the question.
I’m not sure what I’d expect the result to be, even directionally. Consider that if you are setting your learning rate schedule length to the full length of training (as in Chinchilla), then “doing a 2-epoch run” is not identical to “doing a 1-epoch run, then doing another epoch.” You’ll have a higher LR during the first epoch than the 1-epoch run would have had, which would have been suboptimal if you had stopped at the first epoch.
Thanks, that’s interesting… the odd thing about using a single epoch, or even two epochs, is that you’re treating the data points differently. To extract as much knowledge as possible from each data point (to approach L(D)), there should be some optimal combination of pre-training and learning rate. The very first step, starting from random weights, presumably can’t extract high level knowledge very well because the model is still trying to learn low level trends like word frequency. So if the first batch has valuable high level patterns and you never revisit it, it’s effectively leaving data on the table. Maybe with a large enough model (or a large enough batch size?) this effect isn’t too bad though.
You’re right, the idea that multiple epochs can’t possibly help is one of the weakest links in the post. Sometime soon I hope to edit the post with a correction / expansion of that discussion, but I need to collect my thoughts more first—I’m kinda confused by this too.
After thinking more about it, I agree that the repeated-data papers don’t provide much evidence that multiple epochs are harmful.
For example, although the Anthropic repeated-data paper does consider cases where a non-small fraction of total training tokens are repeated more than once. In their most extreme case,
half of the training tokens are never repeated during training, and
the other half of training tokens are some (smaller) portion of the original dataset, repeated 2 or more times
But this effectively lowers the total size of the model’s training dataset—the number of training tokens is held constant (100B), so the repeated copies are taking up space that would otherwise be used for fresh data. For example, if the repeated tokens are repeated 2 times, then we are only using 3⁄4 of the data we could be (we select 1⁄2 for the unrepeated part, and then select 1⁄4 and repeat it twice for the other part).
We’d expect this to hurt the model, and to hurt larger models more, which explains some fraction of the observed effect.
I think there’s a much stronger case that multiple epochs are surprisingly unhelpful for large models, even if they aren’t harmful. I went over that case in this post. (Which was based on the earlier Kaplan et al papers, but I think the basic result still holds.)
However, multiple epochs do help, just less so as N grows… so even if they are negligibly helpful at GPT-3 size or above, they still might be relevantly helpful at Chinchilla size or below. (And this would then push the compute optimal N even further down relative to Chinchilla, preferring smaller models + more steps.)
It would be really nice to see an extension of the Chinchilla experiment that tried multiple epochs, which would directly answer the question.
I’m not sure what I’d expect the result to be, even directionally. Consider that if you are setting your learning rate schedule length to the full length of training (as in Chinchilla), then “doing a 2-epoch run” is not identical to “doing a 1-epoch run, then doing another epoch.” You’ll have a higher LR during the first epoch than the 1-epoch run would have had, which would have been suboptimal if you had stopped at the first epoch.
Thanks, that’s interesting… the odd thing about using a single epoch, or even two epochs, is that you’re treating the data points differently. To extract as much knowledge as possible from each data point (to approach L(D)), there should be some optimal combination of pre-training and learning rate. The very first step, starting from random weights, presumably can’t extract high level knowledge very well because the model is still trying to learn low level trends like word frequency. So if the first batch has valuable high level patterns and you never revisit it, it’s effectively leaving data on the table. Maybe with a large enough model (or a large enough batch size?) this effect isn’t too bad though.