An important distinction here is that the number of tokens a model was trained for should not be confused with the number of tokens in a dataset: if each token is seen exactly once during training then it has been trained for one “epoch”.
In my experience scaling continues for quite a few epochs over the same datset, only if the model has more parameters than the datset tokens and training for >10 epochs does overfitting kick in and scaling break down.
This distinction exists in general, but it’s irrelevant when training sufficiently large LMs.
It is well-established that repeating data during large LM training is not a good practice. Depending on the model size and the amount of repeating, one finds that it is either
a suboptimal use of compute (relative to training a bigger model for 1 epoch), or
actively harmful, as measured by test loss or loss on out-of-distribution data
with (2) kicking in earlier (in terms of the amount of repeating) for larger models, as shown in this paper (Figure 4 and surrounding discussion).
For more, see
references linked in footnote 11 of this post, on how repeating data can be harmful
my earlier post here, on how repeating data can be compute-inefficient even when it’s not harmful
this report on my own experience finetuning a 6.1B model, where >1 epoch was harmful
I think it would be a great follow-up post to explain why you think repeating data is not going to be the easy way out for the scaling enthusiasts at Deepmind and OpenAI.
I find the Figure 4 discussion at your first link quite confusing. They study repeated data i.e. disbalanced datasets to then draw conclusions about repeating data i.e. training for several epochs. The performance hit they observe seems to not be massive (when talking about scaling a couple of OOMs) and they keep the number of training tokens constant.
I really can’t tell how this informs me about what would happen if somebody tried to scale compute 1000-fold and had to repeat data to do it compute-optimally, which seems to be the relevant question.
So do you think, once we get to the point where essentially all new language models are trained on essentially all existing language data, it will always be more compute efficient to increase the size of the model rather than train for a second epoch?
This would seem very unintuitive and is not directly addressed by the papers you linked in footnote 11, which deal with small portions of the dataset betting repeated.
You’re right, the idea that multiple epochs can’t possibly help is one of the weakest links in the post. Sometime soon I hope to edit the post with a correction / expansion of that discussion, but I need to collect my thoughts more first—I’m kinda confused by this too.
After thinking more about it, I agree that the repeated-data papers don’t provide much evidence that multiple epochs are harmful.
For example, although the Anthropic repeated-data paperdoes consider cases where a non-small fraction of total training tokens are repeated more than once. In their most extreme case,
half of the training tokens are never repeated during training, and
the other half of training tokens are some (smaller) portion of the original dataset, repeated 2 or more times
But this effectively lowers the total size of the model’s training dataset—the number of training tokens is held constant (100B), so the repeated copies are taking up space that would otherwise be used for fresh data. For example, if the repeated tokens are repeated 2 times, then we are only using 3⁄4 of the data we could be (we select 1⁄2 for the unrepeated part, and then select 1⁄4 and repeat it twice for the other part).
We’d expect this to hurt the model, and to hurt larger models more, which explains some fraction of the observed effect.
I think there’s a much stronger case that multiple epochs are surprisingly unhelpful for large models, even if they aren’t harmful. I went over that case in this post. (Which was based on the earlier Kaplan et al papers, but I think the basic result still holds.)
However, multiple epochs do help, just less so as N grows… so even if they are negligibly helpful at GPT-3 size or above, they still might be relevantly helpful at Chinchilla size or below. (And this would then push the compute optimal N even further down relative to Chinchilla, preferring smaller models + more steps.)
It would be really nice to see an extension of the Chinchilla experiment that tried multiple epochs, which would directly answer the question.
I’m not sure what I’d expect the result to be, even directionally. Consider that if you are setting your learning rate schedule length to the full length of training (as in Chinchilla), then “doing a 2-epoch run” is not identical to “doing a 1-epoch run, then doing another epoch.” You’ll have a higher LR during the first epoch than the 1-epoch run would have had, which would have been suboptimal if you had stopped at the first epoch.
Thanks, that’s interesting… the odd thing about using a single epoch, or even two epochs, is that you’re treating the data points differently. To extract as much knowledge as possible from each data point (to approach L(D)), there should be some optimal combination of pre-training and learning rate. The very first step, starting from random weights, presumably can’t extract high level knowledge very well because the model is still trying to learn low level trends like word frequency. So if the first batch has valuable high level patterns and you never revisit it, it’s effectively leaving data on the table. Maybe with a large enough model (or a large enough batch size?) this effect isn’t too bad though.
only if the model has more parameters than the dataset tokens and training for >10 epochs does overfitting kick in and scaling break down.
That sounds surprising. You are claiming that you observe the exact same loss, and downstream benchmarks, if you train a model on a dataset for 10 epochs as you do training on 10x more data for 1 epoch?
I would have expected some substantial degradation in efficiency such that the 10-epoch case was equivalent to training on 5x the data or something.
Twitter points me to an instance of this with T5, Figure 6/Table 9: at the lowest tested level of 64 repeats, there is slight downstream benchmark harm but still a lot less than I would’ve guessed.
Not sure how strongly to take this: those benchmarks are weak, not very comprehensive, and wouldn’t turn up harm to interesting capabilities like few-shots or emergent ones like inner-monologues; but on the other hand, T5 is also a pretty strong model-family, was SOTA in several ways at the time & the family regularly used in cutting-edge work still, and so it’s notable that it’s harmed so little.
An important distinction here is that the number of tokens a model was trained for should not be confused with the number of tokens in a dataset: if each token is seen exactly once during training then it has been trained for one “epoch”.
In my experience scaling continues for quite a few epochs over the same datset, only if the model has more parameters than the datset tokens and training for >10 epochs does overfitting kick in and scaling break down.
This distinction exists in general, but it’s irrelevant when training sufficiently large LMs.
It is well-established that repeating data during large LM training is not a good practice. Depending on the model size and the amount of repeating, one finds that it is either
a suboptimal use of compute (relative to training a bigger model for 1 epoch), or
actively harmful, as measured by test loss or loss on out-of-distribution data
with (2) kicking in earlier (in terms of the amount of repeating) for larger models, as shown in this paper (Figure 4 and surrounding discussion).
For more, see
references linked in footnote 11 of this post, on how repeating data can be harmful
my earlier post here, on how repeating data can be compute-inefficient even when it’s not harmful
this report on my own experience finetuning a 6.1B model, where >1 epoch was harmful
I think it would be a great follow-up post to explain why you think repeating data is not going to be the easy way out for the scaling enthusiasts at Deepmind and OpenAI.
I find the Figure 4 discussion at your first link quite confusing. They study repeated data i.e. disbalanced datasets to then draw conclusions about repeating data i.e. training for several epochs. The performance hit they observe seems to not be massive (when talking about scaling a couple of OOMs) and they keep the number of training tokens constant.
I really can’t tell how this informs me about what would happen if somebody tried to scale compute 1000-fold and had to repeat data to do it compute-optimally, which seems to be the relevant question.
So do you think, once we get to the point where essentially all new language models are trained on essentially all existing language data, it will always be more compute efficient to increase the size of the model rather than train for a second epoch?
This would seem very unintuitive and is not directly addressed by the papers you linked in footnote 11, which deal with small portions of the dataset betting repeated.
You’re right, the idea that multiple epochs can’t possibly help is one of the weakest links in the post. Sometime soon I hope to edit the post with a correction / expansion of that discussion, but I need to collect my thoughts more first—I’m kinda confused by this too.
After thinking more about it, I agree that the repeated-data papers don’t provide much evidence that multiple epochs are harmful.
For example, although the Anthropic repeated-data paper does consider cases where a non-small fraction of total training tokens are repeated more than once. In their most extreme case,
half of the training tokens are never repeated during training, and
the other half of training tokens are some (smaller) portion of the original dataset, repeated 2 or more times
But this effectively lowers the total size of the model’s training dataset—the number of training tokens is held constant (100B), so the repeated copies are taking up space that would otherwise be used for fresh data. For example, if the repeated tokens are repeated 2 times, then we are only using 3⁄4 of the data we could be (we select 1⁄2 for the unrepeated part, and then select 1⁄4 and repeat it twice for the other part).
We’d expect this to hurt the model, and to hurt larger models more, which explains some fraction of the observed effect.
I think there’s a much stronger case that multiple epochs are surprisingly unhelpful for large models, even if they aren’t harmful. I went over that case in this post. (Which was based on the earlier Kaplan et al papers, but I think the basic result still holds.)
However, multiple epochs do help, just less so as N grows… so even if they are negligibly helpful at GPT-3 size or above, they still might be relevantly helpful at Chinchilla size or below. (And this would then push the compute optimal N even further down relative to Chinchilla, preferring smaller models + more steps.)
It would be really nice to see an extension of the Chinchilla experiment that tried multiple epochs, which would directly answer the question.
I’m not sure what I’d expect the result to be, even directionally. Consider that if you are setting your learning rate schedule length to the full length of training (as in Chinchilla), then “doing a 2-epoch run” is not identical to “doing a 1-epoch run, then doing another epoch.” You’ll have a higher LR during the first epoch than the 1-epoch run would have had, which would have been suboptimal if you had stopped at the first epoch.
Thanks, that’s interesting… the odd thing about using a single epoch, or even two epochs, is that you’re treating the data points differently. To extract as much knowledge as possible from each data point (to approach L(D)), there should be some optimal combination of pre-training and learning rate. The very first step, starting from random weights, presumably can’t extract high level knowledge very well because the model is still trying to learn low level trends like word frequency. So if the first batch has valuable high level patterns and you never revisit it, it’s effectively leaving data on the table. Maybe with a large enough model (or a large enough batch size?) this effect isn’t too bad though.
This paper is very unrepresentative—it seems to test 1 vs 64-1,000,000 repeats of data, not 1 vs 2-10 repeats as you would use in practice
I can’t access the wand link, maybe you have to change the access rules
I was interested in the report on fine-tuning a model for more than 1 epoch, even though finetuning is obviously not the same as training.
It should work now, sorry about that.
That sounds surprising. You are claiming that you observe the exact same loss, and downstream benchmarks, if you train a model on a dataset for 10 epochs as you do training on 10x more data for 1 epoch?
I would have expected some substantial degradation in efficiency such that the 10-epoch case was equivalent to training on 5x the data or something.
Twitter points me to an instance of this with T5, Figure 6/Table 9: at the lowest tested level of 64 repeats, there is slight downstream benchmark harm but still a lot less than I would’ve guessed.
Not sure how strongly to take this: those benchmarks are weak, not very comprehensive, and wouldn’t turn up harm to interesting capabilities like few-shots or emergent ones like inner-monologues; but on the other hand, T5 is also a pretty strong model-family, was SOTA in several ways at the time & the family regularly used in cutting-edge work still, and so it’s notable that it’s harmed so little.