But it’s not totally clear. In my experience using a suboptimal learning rate sometimes seems to put the model on the wrong kind of trajectory, i.e. you can’t necessarily switch to the “correct” learning rate and still get the same performance as if you’d used the correct schedule from the beginning.
But, I don’t really understand this from the abstract alone. I thought the Kaplan scaling laws were based on single epoch training? With minimal upsamling of some parts of the training data at most? How do you then get suboptimal scaling laws based on not using enough data?
How do you then get suboptimal scaling laws based on not using enough data?
It was a single-epoch in the sense that they didn’t do at least 1 pass over all their data, they only trained on a subsample of their full Internet text dataset (you can see the ratios in the papers somewhere). But even if they had trained exactly once on every token with none of the oversampling/undersampling business, there’s no reason to expect their 1 dataset to be exactly the right size for every possible model size, regardless of what the scaling may be. Turns out, that fixed amount was much too small for the smaller models, and maybe too large for the largest models. (Although even with the Kaplan law people were undertraining models and getting half-baked results—look at Megatron-NLG.)
I would also say “probably”.
But it’s not totally clear. In my experience using a suboptimal learning rate sometimes seems to put the model on the wrong kind of trajectory, i.e. you can’t necessarily switch to the “correct” learning rate and still get the same performance as if you’d used the correct schedule from the beginning.
But, I don’t really understand this from the abstract alone. I thought the Kaplan scaling laws were based on single epoch training? With minimal upsamling of some parts of the training data at most? How do you then get suboptimal scaling laws based on not using enough data?
Must have been different I suppose.
It was a single-epoch in the sense that they didn’t do at least 1 pass over all their data, they only trained on a subsample of their full Internet text dataset (you can see the ratios in the papers somewhere). But even if they had trained exactly once on every token with none of the oversampling/undersampling business, there’s no reason to expect their 1 dataset to be exactly the right size for every possible model size, regardless of what the scaling may be. Turns out, that fixed amount was much too small for the smaller models, and maybe too large for the largest models. (Although even with the Kaplan law people were undertraining models and getting half-baked results—look at Megatron-NLG.)
Don’t you mean the dataset size was much too large for the smaller models and maybe too small for the largest models?