p.b. comments on [Link] Training Compute-Optimal Large Language Models

p.b. 1 Apr 2022 16:41 UTC
3 points
I would also say “probably”.
But it’s not totally clear. In my experience using a suboptimal learning rate sometimes seems to put the model on the wrong kind of trajectory, i.e. you can’t necessarily switch to the “correct” learning rate and still get the same performance as if you’d used the correct schedule from the beginning.
But, I don’t really understand this from the abstract alone. I thought the Kaplan scaling laws were based on single epoch training? With minimal upsamling of some parts of the training data at most? How do you then get suboptimal scaling laws based on not using enough data?
Must have been different I suppose.
- gwern 1 Apr 2022 16:48 UTC
  4 points
  Parent
  
  How do you then get suboptimal scaling laws based on not using enough data?
  
  It was a single-epoch in the sense that they didn’t do at least 1 pass over all their data, they only trained on a subsample of their full Internet text dataset (you can see the ratios in the papers somewhere). But even if they had trained exactly once on every token with none of the oversampling/undersampling business, there’s no reason to expect their 1 dataset to be exactly the right size for every possible model size, regardless of what the scaling may be. Turns out, that fixed amount was much too small for the smaller models, and maybe too large for the largest models. (Although even with the Kaplan law people were undertraining models and getting half-baked results—look at Megatron-NLG.)
  - Daniel Kokotajlo 1 Apr 2022 17:02 UTC
    2 points
    Parent
    Don’t you mean the dataset size was much too large for the smaller models and maybe too small for the largest models?