gwern comments on chinchilla’s wild implications

gwern 31 Jul 2022 21:05 UTC
10 points
3

only if the model has more parameters than the dataset tokens and training for >10 epochs does overfitting kick in and scaling break down.

That sounds surprising. You are claiming that you observe the exact same loss, and downstream benchmarks, if you train a model on a dataset for 10 epochs as you do training on 10x more data for 1 epoch?

I would have expected some substantial degradation in efficiency such that the 10-epoch case was equivalent to training on 5x the data or something.
- gwern 2 Aug 2022 1:42 UTC
  11 points
  1
  Parent
  Twitter points me to an instance of this with T5, Figure 6/Table 9: at the lowest tested level of 64 repeats, there is slight downstream benchmark harm but still a lot less than I would’ve guessed.
  
  Not sure how strongly to take this: those benchmarks are weak, not very comprehensive, and wouldn’t turn up harm to interesting capabilities like few-shots or emergent ones like inner-monologues; but on the other hand, T5 is also a pretty strong model-family, was SOTA in several ways at the time & the family regularly used in cutting-edge work still, and so it’s notable that it’s harmed so little.