I am really very surprised that the distributions don’t seem to match any standard parameterized distribution. I was fully ready to say “okay, let’s retrain some of the smaller Pythia models initialized using the distribution you think the weights come from” but apparently we can’t do that easily.
That was my own immediate response: “if these distributions are so universal, why doesn’t this show that standard initializations suck, and that you should reverse-engineer the final distribution and initialize that way?” Either the model won’t train or will train much slower, which suggests that the understanding or training setup here is totally wrong in some way; or it will train at the same speed, suggesting that the distributions are misleading and more like epiphenomena or side-effects of what is actually training/‘doing the work’ (which is still going on under the hood & just no longer visible in some crude summary statistics); or it will train much much faster, which is a huge optimization win and also verifies the importance of the initialization distribution being correct with all the theoretical implications thereof.
Why doesn’t Pythia let you do that? Sure, perhaps they aren’t exactly a logistic or familiar power law or a convenient parametric function, but if you want to replicate the initialization distribution elsewhere, just do something nonparametrically like sample from a histogram/cdf or permute the parameters from a finished model, and then maybe train on some equivalent heldout text dataset to reduce any lottery-ticket weirdness. (Verify it does anything useful/interesting, and it won’t be hard to find some flexible parametric distribution you can sample from if you need to; if there’s one thing I know about the exponential family of distributions, it’s that it has an awful lot of wiggly bois you’ve never heard of.)
That was my own immediate response: “if these distributions are so universal, why doesn’t this show that standard initializations suck, and that you should reverse-engineer the final distribution and initialize that way?”
It might show this. As far as I know nobody has done this experiment. Either way results would be interesting.
Either the model won’t train or will train much slower, which suggests that the understanding or training setup here is totally wrong in some way; or it will train at the same speed, suggesting that the distributions are misleading and more like epiphenomena or side-effects of what is actually training/‘doing the work’ (which is still going on under the hood & just no longer visible in some crude summary statistics); or it will train much much faster, which is a huge optimization win and also verifies the importance of the initialization distribution being correct with all the theoretical implications thereof.
My intuition/prediction here (which I have fairly low confidence in) is that if you initialised it in this way, the early bit of training will be sped up, because it seems that a lot of what the model is doing in the first few steps of training, when loss is rapidly decreasing, is just broadly fitting the general scale and distributional shape of the data distribution. We see that the model roughly reaches something like its final distribution quite rapidly, and I expect this second, much longer, phase to be where most of the important representation formation to happen which won’t be much affected by initialising it in this way. So basically we will get a slight speed boost / cut out a number of early steps of training but that benefits would diminish after the earliest phase of training. Would definitely be interesting to look at more systematically.
just do something nonparametrically like sample from a histogram/cdf or permute the parameters from a finished model, and then maybe train on some equivalent heldout text dataset to reduce any lottery-ticket weirdness
Yes, I guess I am overstating the possible speedup if I call it ‘much much faster’, but there ought to at least be a noticeable speedup by cutting out the early steps if it’s basically just wasting time/data/compute to fix the distributions. It might also converge to a better and different optimum.
Perhaps more interestingly is the consequences for the training and arch: a lot of stuff with Transformers, like special burnin schedules or heavy (ab)use of normalization has long struck me as potentially just hacks around bad initializations that are trying to cause divergence. I’ve long been impressed by how it can be possible to remove normalization entirely or train stable vanilla NNs 10,000 layers deep just by improving the initialization/distribution. Reverse-engineering the final distribution may be a helpful method. If you use the final distribution, you may be able to drop some complexity from the overall Transformer recipe.
Yes, I guess I am overstating the possible speedup if I call it ‘much much faster’, but there ought to at least be a noticeable speedup by cutting out the early steps if it’s basically just wasting time/data/compute to fix the distributions. It might also converge to a better and different optimum.
I think we agree here. Testing whether it converges to a better optimum would also be interesting.
Perhaps more interestingly is the consequences for the training and arch: a lot of stuff with Transformers, like special burnin schedules or heavy (ab)use of normalization has long struck me as potentially just hacks around bad initializations that are trying to cause divergence
Yes. I feel that this might help especially with warmup which could just plausibly be because at the start there are very large and mostly non-informative gradients towards just being the right distribution, which would be removed if you start out at the right gradient.
That was my own immediate response: “if these distributions are so universal, why doesn’t this show that standard initializations suck, and that you should reverse-engineer the final distribution and initialize that way?” Either the model won’t train or will train much slower, which suggests that the understanding or training setup here is totally wrong in some way; or it will train at the same speed, suggesting that the distributions are misleading and more like epiphenomena or side-effects of what is actually training/‘doing the work’ (which is still going on under the hood & just no longer visible in some crude summary statistics); or it will train much much faster, which is a huge optimization win and also verifies the importance of the initialization distribution being correct with all the theoretical implications thereof.
Why doesn’t Pythia let you do that? Sure, perhaps they aren’t exactly a logistic or familiar power law or a convenient parametric function, but if you want to replicate the initialization distribution elsewhere, just do something nonparametrically like sample from a histogram/cdf or permute the parameters from a finished model, and then maybe train on some equivalent heldout text dataset to reduce any lottery-ticket weirdness. (Verify it does anything useful/interesting, and it won’t be hard to find some flexible parametric distribution you can sample from if you need to; if there’s one thing I know about the exponential family of distributions, it’s that it has an awful lot of wiggly bois you’ve never heard of.)
It might show this. As far as I know nobody has done this experiment. Either way results would be interesting.
My intuition/prediction here (which I have fairly low confidence in) is that if you initialised it in this way, the early bit of training will be sped up, because it seems that a lot of what the model is doing in the first few steps of training, when loss is rapidly decreasing, is just broadly fitting the general scale and distributional shape of the data distribution. We see that the model roughly reaches something like its final distribution quite rapidly, and I expect this second, much longer, phase to be where most of the important representation formation to happen which won’t be much affected by initialising it in this way. So basically we will get a slight speed boost / cut out a number of early steps of training but that benefits would diminish after the earliest phase of training. Would definitely be interesting to look at more systematically.
These are both good ideas.
Yes, I guess I am overstating the possible speedup if I call it ‘much much faster’, but there ought to at least be a noticeable speedup by cutting out the early steps if it’s basically just wasting time/data/compute to fix the distributions. It might also converge to a better and different optimum.
Perhaps more interestingly is the consequences for the training and arch: a lot of stuff with Transformers, like special burnin schedules or heavy (ab)use of normalization has long struck me as potentially just hacks around bad initializations that are trying to cause divergence. I’ve long been impressed by how it can be possible to remove normalization entirely or train stable vanilla NNs 10,000 layers deep just by improving the initialization/distribution. Reverse-engineering the final distribution may be a helpful method. If you use the final distribution, you may be able to drop some complexity from the overall Transformer recipe.
I think we agree here. Testing whether it converges to a better optimum would also be interesting.
Yes. I feel that this might help especially with warmup which could just plausibly be because at the start there are very large and mostly non-informative gradients towards just being the right distribution, which would be removed if you start out at the right gradient.