It also seems likely that the Nano models are extremely overtrained compared to the scaling laws. The scaling laws are for optimal compute during training, but here they want to minimize inference cost so it would make sense to train for significantly longer.
Agreed (well, except for a nitpick that post-Chinchilla versions of scaling laws also make predictions for scaling data and parameter count separately, including in overtraining regions): overtraining during distillation seems like the obvious approach, using a lot of data (possibly much of it synthetic, which would let you avoid issues like memorization of PII and copyright) rather than many epochs, in order to minimize memorization. Using distillation also effectively increases the size of your distillation training set for scaling laws, since the trainee model now gets more data per example: not just the tokens in the correct answer, but their logits and those of all the top alternative tokens according to the larger trainer model. So each document in the distillation training set becomes worth several times as much.
It also seems likely that the Nano models are extremely overtrained compared to the scaling laws. The scaling laws are for optimal compute during training, but here they want to minimize inference cost so it would make sense to train for significantly longer.
Agreed (well, except for a nitpick that post-Chinchilla versions of scaling laws also make predictions for scaling data and parameter count separately, including in overtraining regions): overtraining during distillation seems like the obvious approach, using a lot of data (possibly much of it synthetic, which would let you avoid issues like memorization of PII and copyright) rather than many epochs, in order to minimize memorization. Using distillation also effectively increases the size of your distillation training set for scaling laws, since the trainee model now gets more data per example: not just the tokens in the correct answer, but their logits and those of all the top alternative tokens according to the larger trainer model. So each document in the distillation training set becomes worth several times as much.