beren comments on Basic facts about language models during training

beren 22 Feb 2023 13:09 UTC
2 points
0
It looks like you’re experimenting with the 5 smallest models, but haven’t done analysis on the 2.8B, 6.9B, or 12B models. Is that something you’re planning on adding, or no?
We have done some preliminary analyses on these as well. Primary issue is just that these experiments take longer since the larger models take longer to instantiate from checkpoint (which adds up when there are 142 checkpoints). Am planning to run the same experiments on the larger models and update the post with them at some point however.
I am really very surprised that the distributions don’t seem to match any standard parameterized distribution. I was fully ready to say “okay, let’s retrain some of the smaller Pythia models initialized using the distribution you think the weights come from” but apparently we can’t do that easily. I suppose we can do a MCMC sampler?
I agree the distribution thing is weird and not what I was expecting. I have currently tried to fit to Gaussian, power law, logistic and none are super close in general. I have also tried general fits to generalised exponential functions of the form exp(kx^\alpha) where k and \alpha are free parameters but this optimization just tends to be numerically unstable and give bad results whenever I have tried it. Other people at Conjecture, following the PDLT book, have tried fitting the fourth order perturbative expansion—i.e. exp(x^2 + \gamma x^4) which also runs into numerical issues.
I agree that it seems to refute some of the theoretical assumptions of the NTK literature, but I wonder if perhaps it’s consistent with the [Tensor Programs](https://arxiv.org/abs/2203.03466) work by Greg Yang et al. that lead to muP.
Maybe? I haven’t studied Tensor programs in extreme detail but my understanding is that they assume Gaussian limits for their proofs. However, afaik muP does work in practice so maybe this isn’t such a big deal?
To clarify what’s going on with the Pythia models:
This is great to have clarified thanks! I’ll tone down the disclaimer then and add the note about the new nomenclature.
- Mark Goodhead 25 Feb 2023 7:05 UTC
  1 point
  0
  Parent
  Have you tried fitting a Student’s t distribution? The nice thing about that distribution is the nu parameter completely controls the shape of the tails and is equivalent to the gaussian where nu is infinite; this would allow you to plot a cool graph of nu against checkpoint steps to get an easy visualisation of exactly how the shape of the tails changes over time.