This is really exciting work to see, and exactly the kind of thing I was hoping people would do when designing the Pythia model suite. It looks like you’re experimenting with the 5 smallest models, but haven’t done analysis on the 2.8B, 6.9B, or 12B models. Is that something you’re planning on adding, or no?
I am really very surprised that the distributions don’t seem to match any standard parameterized distribution. I was fully ready to say “okay, let’s retrain some of the smaller Pythia models initialized using the distribution you think the weights come from” but apparently we can’t do that easily. I suppose we can do a MCMC sampler? In general, it seems like a natural follow-up to the contents of this post is to change the way we initialize things in models, retrain them, and see what happens (esp. with the loss curve). If that’s something you’d like to collaborate with EleutherAI about, I would be more than happy to arrange something :)
In general, the reliability of the things you’re seeing across model scales is really cool. I agree that it seems to refute some of the theoretical assumptions of the NTK literature, but I wonder if perhaps it’s consistent with the Tensor Programs work by Greg Yang et al. that lead to muP.
To clarify what’s going on with the Pythia models:
This work appears to be using the initial model release, which has an inconsistent naming scheme. Some models were named based on total parameters, while others were named based on the number of learnable parameters. The former is what models are typically named based on, but the later is what people put on the x-axis of scaling laws plots. This is a nomenclature change only with no impact on results.
Shortly after release, we renamed the models to be consistently named using the total number of parameters. The models studied in this post are currently named 70M, 160M, 410M, 1B, and 1.4B.
When writing the paper for these models, we discovered a handful of inconsistencies in the suite’s hyperparameters. Specifically, the batch size and some all-reduce optimizations were inconsistent across training. We expect this to have no impact on the OP or 90% of experiments using the suite. That said, if we’re going to spend all this compute to design a suite for controlled scientific experiments, it should control for as many factors as possible. The current models will remain public and people are encouraged to compare results across them to further validate that various properties don’t impact the behavior that they’re finding.
The Pythia models is an amazing source. This is a great tool and work.
One experiment that could maybe help disentangle idiosyncrasies from robust behaviors would be to run these experiments with a pair of seeds on each model size. With the currently trained models this could maybe just involve plotting the exact same curves comparing the “deduplicated” versus “non deduplicated” trained models since dataset deduplication likely has a limited impact on the model averaged training dynamic of the weights as investigated here (there are obviously countless more experiments that could be added but this one is maybe an easy one).
I am really very surprised that the distributions don’t seem to match any standard parameterized distribution. I was fully ready to say “okay, let’s retrain some of the smaller Pythia models initialized using the distribution you think the weights come from” but apparently we can’t do that easily.
That was my own immediate response: “if these distributions are so universal, why doesn’t this show that standard initializations suck, and that you should reverse-engineer the final distribution and initialize that way?” Either the model won’t train or will train much slower, which suggests that the understanding or training setup here is totally wrong in some way; or it will train at the same speed, suggesting that the distributions are misleading and more like epiphenomena or side-effects of what is actually training/‘doing the work’ (which is still going on under the hood & just no longer visible in some crude summary statistics); or it will train much much faster, which is a huge optimization win and also verifies the importance of the initialization distribution being correct with all the theoretical implications thereof.
Why doesn’t Pythia let you do that? Sure, perhaps they aren’t exactly a logistic or familiar power law or a convenient parametric function, but if you want to replicate the initialization distribution elsewhere, just do something nonparametrically like sample from a histogram/cdf or permute the parameters from a finished model, and then maybe train on some equivalent heldout text dataset to reduce any lottery-ticket weirdness. (Verify it does anything useful/interesting, and it won’t be hard to find some flexible parametric distribution you can sample from if you need to; if there’s one thing I know about the exponential family of distributions, it’s that it has an awful lot of wiggly bois you’ve never heard of.)
That was my own immediate response: “if these distributions are so universal, why doesn’t this show that standard initializations suck, and that you should reverse-engineer the final distribution and initialize that way?”
It might show this. As far as I know nobody has done this experiment. Either way results would be interesting.
Either the model won’t train or will train much slower, which suggests that the understanding or training setup here is totally wrong in some way; or it will train at the same speed, suggesting that the distributions are misleading and more like epiphenomena or side-effects of what is actually training/‘doing the work’ (which is still going on under the hood & just no longer visible in some crude summary statistics); or it will train much much faster, which is a huge optimization win and also verifies the importance of the initialization distribution being correct with all the theoretical implications thereof.
My intuition/prediction here (which I have fairly low confidence in) is that if you initialised it in this way, the early bit of training will be sped up, because it seems that a lot of what the model is doing in the first few steps of training, when loss is rapidly decreasing, is just broadly fitting the general scale and distributional shape of the data distribution. We see that the model roughly reaches something like its final distribution quite rapidly, and I expect this second, much longer, phase to be where most of the important representation formation to happen which won’t be much affected by initialising it in this way. So basically we will get a slight speed boost / cut out a number of early steps of training but that benefits would diminish after the earliest phase of training. Would definitely be interesting to look at more systematically.
just do something nonparametrically like sample from a histogram/cdf or permute the parameters from a finished model, and then maybe train on some equivalent heldout text dataset to reduce any lottery-ticket weirdness
Yes, I guess I am overstating the possible speedup if I call it ‘much much faster’, but there ought to at least be a noticeable speedup by cutting out the early steps if it’s basically just wasting time/data/compute to fix the distributions. It might also converge to a better and different optimum.
Perhaps more interestingly is the consequences for the training and arch: a lot of stuff with Transformers, like special burnin schedules or heavy (ab)use of normalization has long struck me as potentially just hacks around bad initializations that are trying to cause divergence. I’ve long been impressed by how it can be possible to remove normalization entirely or train stable vanilla NNs 10,000 layers deep just by improving the initialization/distribution. Reverse-engineering the final distribution may be a helpful method. If you use the final distribution, you may be able to drop some complexity from the overall Transformer recipe.
Yes, I guess I am overstating the possible speedup if I call it ‘much much faster’, but there ought to at least be a noticeable speedup by cutting out the early steps if it’s basically just wasting time/data/compute to fix the distributions. It might also converge to a better and different optimum.
I think we agree here. Testing whether it converges to a better optimum would also be interesting.
Perhaps more interestingly is the consequences for the training and arch: a lot of stuff with Transformers, like special burnin schedules or heavy (ab)use of normalization has long struck me as potentially just hacks around bad initializations that are trying to cause divergence
Yes. I feel that this might help especially with warmup which could just plausibly be because at the start there are very large and mostly non-informative gradients towards just being the right distribution, which would be removed if you start out at the right gradient.
Also, I meant to ask you, what does the learning rate schedule of these models look like? In a lot of the summary statistics plots we see either peaks and asymptotes and sometimes clear phase transitions between checkpoints 20 and 40, and I was wondering if this is related to the learning rate schedule somehow (end of warmup?)
Linear warm-up over the first 10% of training, then cosine decay to a minimum of one-tenth the peak LR which is set to occur at the end of training (300B tokens). Peak LRs vary by model but are roughly consistent with GPT-3 and OPT values. You can find all the config details on GitHub. The main divergence relevant to this conversation from mainstream approaches is that we use a constant batch size (2M) throughout scaling. Prior work uses batch sizes up to 10x smaller for the smallest models, but we find that we can train large batch small models without any problems. This enables us to achieve a substantial wall-clock speed-up for small models by throwing more GPUs at them. We continue to use this batch size for the 11B model for consistency, although the standard progression of batch sizes would encourage one of 3M or 4M by that point.
Checkpoint 20 and 40 are at 20k and 40k iterations respectively, and the entire training runs for 143k iterations. So they occur relatively shortly after the LR peaks, but don’t coincide with anything I know to be particularly special.
It looks like you’re experimenting with the 5 smallest models, but haven’t done analysis on the 2.8B, 6.9B, or 12B models. Is that something you’re planning on adding, or no?
We have done some preliminary analyses on these as well. Primary issue is just that these experiments take longer since the larger models take longer to instantiate from checkpoint (which adds up when there are 142 checkpoints). Am planning to run the same experiments on the larger models and update the post with them at some point however.
I am really very surprised that the distributions don’t seem to match any standard parameterized distribution. I was fully ready to say “okay, let’s retrain some of the smaller Pythia models initialized using the distribution you think the weights come from” but apparently we can’t do that easily. I suppose we can do a MCMC sampler?
I agree the distribution thing is weird and not what I was expecting. I have currently tried to fit to Gaussian, power law, logistic and none are super close in general. I have also tried general fits to generalised exponential functions of the form exp(kx^\alpha) where k and \alpha are free parameters but this optimization just tends to be numerically unstable and give bad results whenever I have tried it. Other people at Conjecture, following the PDLT book, have tried fitting the fourth order perturbative expansion—i.e. exp(x^2 + \gamma x^4) which also runs into numerical issues.
I agree that it seems to refute some of the theoretical assumptions of the NTK literature, but I wonder if perhaps it’s consistent with the [Tensor Programs](https://arxiv.org/abs/2203.03466) work by Greg Yang et al. that lead to muP.
Maybe? I haven’t studied Tensor programs in extreme detail but my understanding is that they assume Gaussian limits for their proofs. However, afaik muP does work in practice so maybe this isn’t such a big deal?
To clarify what’s going on with the Pythia models:
This is great to have clarified thanks! I’ll tone down the disclaimer then and add the note about the new nomenclature.
Have you tried fitting a Student’s t distribution? The nice thing about that distribution is the nu parameter completely controls the shape of the tails and is equivalent to the gaussian where nu is infinite; this would allow you to plot a cool graph of nu against checkpoint steps to get an easy visualisation of exactly how the shape of the tails changes over time.
This is really exciting work to see, and exactly the kind of thing I was hoping people would do when designing the Pythia model suite. It looks like you’re experimenting with the 5 smallest models, but haven’t done analysis on the 2.8B, 6.9B, or 12B models. Is that something you’re planning on adding, or no?
I am really very surprised that the distributions don’t seem to match any standard parameterized distribution. I was fully ready to say “okay, let’s retrain some of the smaller Pythia models initialized using the distribution you think the weights come from” but apparently we can’t do that easily. I suppose we can do a MCMC sampler? In general, it seems like a natural follow-up to the contents of this post is to change the way we initialize things in models, retrain them, and see what happens (esp. with the loss curve). If that’s something you’d like to collaborate with EleutherAI about, I would be more than happy to arrange something :)
In general, the reliability of the things you’re seeing across model scales is really cool. I agree that it seems to refute some of the theoretical assumptions of the NTK literature, but I wonder if perhaps it’s consistent with the Tensor Programs work by Greg Yang et al. that lead to muP.
To clarify what’s going on with the Pythia models:
This work appears to be using the initial model release, which has an inconsistent naming scheme. Some models were named based on total parameters, while others were named based on the number of learnable parameters. The former is what models are typically named based on, but the later is what people put on the x-axis of scaling laws plots. This is a nomenclature change only with no impact on results.
Shortly after release, we renamed the models to be consistently named using the total number of parameters. The models studied in this post are currently named 70M, 160M, 410M, 1B, and 1.4B.
When writing the paper for these models, we discovered a handful of inconsistencies in the suite’s hyperparameters. Specifically, the batch size and some all-reduce optimizations were inconsistent across training. We expect this to have no impact on the OP or 90% of experiments using the suite. That said, if we’re going to spend all this compute to design a suite for controlled scientific experiments, it should control for as many factors as possible. The current models will remain public and people are encouraged to compare results across them to further validate that various properties don’t impact the behavior that they’re finding.
The Pythia models is an amazing source. This is a great tool and work.
One experiment that could maybe help disentangle idiosyncrasies from robust behaviors would be to run these experiments with a pair of seeds on each model size. With the currently trained models this could maybe just involve plotting the exact same curves comparing the “deduplicated” versus “non deduplicated” trained models since dataset deduplication likely has a limited impact on the model averaged training dynamic of the weights as investigated here (there are obviously countless more experiments that could be added but this one is maybe an easy one).
Good idea—will run this experiment!
That was my own immediate response: “if these distributions are so universal, why doesn’t this show that standard initializations suck, and that you should reverse-engineer the final distribution and initialize that way?” Either the model won’t train or will train much slower, which suggests that the understanding or training setup here is totally wrong in some way; or it will train at the same speed, suggesting that the distributions are misleading and more like epiphenomena or side-effects of what is actually training/‘doing the work’ (which is still going on under the hood & just no longer visible in some crude summary statistics); or it will train much much faster, which is a huge optimization win and also verifies the importance of the initialization distribution being correct with all the theoretical implications thereof.
Why doesn’t Pythia let you do that? Sure, perhaps they aren’t exactly a logistic or familiar power law or a convenient parametric function, but if you want to replicate the initialization distribution elsewhere, just do something nonparametrically like sample from a histogram/cdf or permute the parameters from a finished model, and then maybe train on some equivalent heldout text dataset to reduce any lottery-ticket weirdness. (Verify it does anything useful/interesting, and it won’t be hard to find some flexible parametric distribution you can sample from if you need to; if there’s one thing I know about the exponential family of distributions, it’s that it has an awful lot of wiggly bois you’ve never heard of.)
It might show this. As far as I know nobody has done this experiment. Either way results would be interesting.
My intuition/prediction here (which I have fairly low confidence in) is that if you initialised it in this way, the early bit of training will be sped up, because it seems that a lot of what the model is doing in the first few steps of training, when loss is rapidly decreasing, is just broadly fitting the general scale and distributional shape of the data distribution. We see that the model roughly reaches something like its final distribution quite rapidly, and I expect this second, much longer, phase to be where most of the important representation formation to happen which won’t be much affected by initialising it in this way. So basically we will get a slight speed boost / cut out a number of early steps of training but that benefits would diminish after the earliest phase of training. Would definitely be interesting to look at more systematically.
These are both good ideas.
Yes, I guess I am overstating the possible speedup if I call it ‘much much faster’, but there ought to at least be a noticeable speedup by cutting out the early steps if it’s basically just wasting time/data/compute to fix the distributions. It might also converge to a better and different optimum.
Perhaps more interestingly is the consequences for the training and arch: a lot of stuff with Transformers, like special burnin schedules or heavy (ab)use of normalization has long struck me as potentially just hacks around bad initializations that are trying to cause divergence. I’ve long been impressed by how it can be possible to remove normalization entirely or train stable vanilla NNs 10,000 layers deep just by improving the initialization/distribution. Reverse-engineering the final distribution may be a helpful method. If you use the final distribution, you may be able to drop some complexity from the overall Transformer recipe.
I think we agree here. Testing whether it converges to a better optimum would also be interesting.
Yes. I feel that this might help especially with warmup which could just plausibly be because at the start there are very large and mostly non-informative gradients towards just being the right distribution, which would be removed if you start out at the right gradient.
Also, I meant to ask you, what does the learning rate schedule of these models look like? In a lot of the summary statistics plots we see either peaks and asymptotes and sometimes clear phase transitions between checkpoints 20 and 40, and I was wondering if this is related to the learning rate schedule somehow (end of warmup?)
Linear warm-up over the first 10% of training, then cosine decay to a minimum of one-tenth the peak LR which is set to occur at the end of training (300B tokens). Peak LRs vary by model but are roughly consistent with GPT-3 and OPT values. You can find all the config details on GitHub. The main divergence relevant to this conversation from mainstream approaches is that we use a constant batch size (2M) throughout scaling. Prior work uses batch sizes up to 10x smaller for the smallest models, but we find that we can train large batch small models without any problems. This enables us to achieve a substantial wall-clock speed-up for small models by throwing more GPUs at them. We continue to use this batch size for the 11B model for consistency, although the standard progression of batch sizes would encourage one of 3M or 4M by that point.
Checkpoint 20 and 40 are at 20k and 40k iterations respectively, and the entire training runs for 143k iterations. So they occur relatively shortly after the LR peaks, but don’t coincide with anything I know to be particularly special.
We have done some preliminary analyses on these as well. Primary issue is just that these experiments take longer since the larger models take longer to instantiate from checkpoint (which adds up when there are 142 checkpoints). Am planning to run the same experiments on the larger models and update the post with them at some point however.
I agree the distribution thing is weird and not what I was expecting. I have currently tried to fit to Gaussian, power law, logistic and none are super close in general. I have also tried general fits to generalised exponential functions of the form exp(kx^\alpha) where k and \alpha are free parameters but this optimization just tends to be numerically unstable and give bad results whenever I have tried it. Other people at Conjecture, following the PDLT book, have tried fitting the fourth order perturbative expansion—i.e. exp(x^2 + \gamma x^4) which also runs into numerical issues.
Maybe? I haven’t studied Tensor programs in extreme detail but my understanding is that they assume Gaussian limits for their proofs. However, afaik muP does work in practice so maybe this isn’t such a big deal?
This is great to have clarified thanks! I’ll tone down the disclaimer then and add the note about the new nomenclature.
Have you tried fitting a Student’s t distribution? The nice thing about that distribution is the nu parameter completely controls the shape of the tails and is equivalent to the gaussian where nu is infinite; this would allow you to plot a cool graph of nu against checkpoint steps to get an easy visualisation of exactly how the shape of the tails changes over time.