Yes, I guess I am overstating the possible speedup if I call it ‘much much faster’, but there ought to at least be a noticeable speedup by cutting out the early steps if it’s basically just wasting time/data/compute to fix the distributions. It might also converge to a better and different optimum.
Perhaps more interestingly is the consequences for the training and arch: a lot of stuff with Transformers, like special burnin schedules or heavy (ab)use of normalization has long struck me as potentially just hacks around bad initializations that are trying to cause divergence. I’ve long been impressed by how it can be possible to remove normalization entirely or train stable vanilla NNs 10,000 layers deep just by improving the initialization/distribution. Reverse-engineering the final distribution may be a helpful method. If you use the final distribution, you may be able to drop some complexity from the overall Transformer recipe.
Yes, I guess I am overstating the possible speedup if I call it ‘much much faster’, but there ought to at least be a noticeable speedup by cutting out the early steps if it’s basically just wasting time/data/compute to fix the distributions. It might also converge to a better and different optimum.
I think we agree here. Testing whether it converges to a better optimum would also be interesting.
Perhaps more interestingly is the consequences for the training and arch: a lot of stuff with Transformers, like special burnin schedules or heavy (ab)use of normalization has long struck me as potentially just hacks around bad initializations that are trying to cause divergence
Yes. I feel that this might help especially with warmup which could just plausibly be because at the start there are very large and mostly non-informative gradients towards just being the right distribution, which would be removed if you start out at the right gradient.
Yes, I guess I am overstating the possible speedup if I call it ‘much much faster’, but there ought to at least be a noticeable speedup by cutting out the early steps if it’s basically just wasting time/data/compute to fix the distributions. It might also converge to a better and different optimum.
Perhaps more interestingly is the consequences for the training and arch: a lot of stuff with Transformers, like special burnin schedules or heavy (ab)use of normalization has long struck me as potentially just hacks around bad initializations that are trying to cause divergence. I’ve long been impressed by how it can be possible to remove normalization entirely or train stable vanilla NNs 10,000 layers deep just by improving the initialization/distribution. Reverse-engineering the final distribution may be a helpful method. If you use the final distribution, you may be able to drop some complexity from the overall Transformer recipe.
I think we agree here. Testing whether it converges to a better optimum would also be interesting.
Yes. I feel that this might help especially with warmup which could just plausibly be because at the start there are very large and mostly non-informative gradients towards just being the right distribution, which would be removed if you start out at the right gradient.