Relatedly: is “reverse distillation” (ie, generating a model with more parameters from a smaller one) possible for these big transformer models?
Yes. This fits under a couple terms: hot-starting, warm initialization with model surgery a la OA5, slow weights vs fast weights / meta-learning, tied weights, etc. It’s also a fairly common idea in Neural Architecture Search where you try to learn a small ‘cell’ or ‘module’ (either just the architecture or the weights as well) cheaply and then stack a bunch of them to get your final SOTA model, and can be combined eg. SMASH. An example of using this to train very large models is “M6-10T: A Sharing-Delinking Paradigm for Efficient Multi-Trillion Parameter Pretraining”, Lin et al 2021. It seems appealing but raises questions about efficiency & bias: are you really still on the same scaling curve as the ‘true’ large model, given that the smaller model you are training almost by definition has a different (worse) scaling curve, and might you not be sabotaging your final model by hardwiring the weaknesses of the small initial model into it, rendering the approach penny-wise pound-foolish?
Yes. This fits under a couple terms: hot-starting, warm initialization with model surgery a la OA5, slow weights vs fast weights / meta-learning, tied weights, etc. It’s also a fairly common idea in Neural Architecture Search where you try to learn a small ‘cell’ or ‘module’ (either just the architecture or the weights as well) cheaply and then stack a bunch of them to get your final SOTA model, and can be combined eg. SMASH. An example of using this to train very large models is “M6-10T: A Sharing-Delinking Paradigm for Efficient Multi-Trillion Parameter Pretraining”, Lin et al 2021. It seems appealing but raises questions about efficiency & bias: are you really still on the same scaling curve as the ‘true’ large model, given that the smaller model you are training almost by definition has a different (worse) scaling curve, and might you not be sabotaging your final model by hardwiring the weaknesses of the small initial model into it, rendering the approach penny-wise pound-foolish?