only if the model has more parameters than the dataset tokens and training for >10 epochs does overfitting kick in and scaling break down.
That sounds surprising. You are claiming that you observe the exact same loss, and downstream benchmarks, if you train a model on a dataset for 10 epochs as you do training on 10x more data for 1 epoch?
I would have expected some substantial degradation in efficiency such that the 10-epoch case was equivalent to training on 5x the data or something.
Twitter points me to an instance of this with T5, Figure 6/Table 9: at the lowest tested level of 64 repeats, there is slight downstream benchmark harm but still a lot less than I would’ve guessed.
Not sure how strongly to take this: those benchmarks are weak, not very comprehensive, and wouldn’t turn up harm to interesting capabilities like few-shots or emergent ones like inner-monologues; but on the other hand, T5 is also a pretty strong model-family, was SOTA in several ways at the time & the family regularly used in cutting-edge work still, and so it’s notable that it’s harmed so little.
That sounds surprising. You are claiming that you observe the exact same loss, and downstream benchmarks, if you train a model on a dataset for 10 epochs as you do training on 10x more data for 1 epoch?
I would have expected some substantial degradation in efficiency such that the 10-epoch case was equivalent to training on 5x the data or something.
Twitter points me to an instance of this with T5, Figure 6/Table 9: at the lowest tested level of 64 repeats, there is slight downstream benchmark harm but still a lot less than I would’ve guessed.
Not sure how strongly to take this: those benchmarks are weak, not very comprehensive, and wouldn’t turn up harm to interesting capabilities like few-shots or emergent ones like inner-monologues; but on the other hand, T5 is also a pretty strong model-family, was SOTA in several ways at the time & the family regularly used in cutting-edge work still, and so it’s notable that it’s harmed so little.