That got people to, I dunno, 6 layers instead of 3 layers or something? But it focused attention on the problem of exploding gradients as the reason why deeply layered neural nets never worked, and that kicked off the entire modern field of deep learning, more or less.
This might be a chicken or egg thing. We couldn’t train big neural networks until we could initialize them correctly, but we also couldn’t train them until we had hardware that wasn’t embarrassing / benchmark datasets that were nontrivial.
If I had to blame something, I’d blame GPUs and custom kernel writing getting to the point that small research labs could begin to tinker with ~few million parameter models on essentially single machines + a few GPUs. (The AlexNet model from 2012 was only 60 million parameters!)
This might be a chicken or egg thing. We couldn’t train big neural networks until we could initialize them correctly, but we also couldn’t train them until we had hardware that wasn’t embarrassing / benchmark datasets that were nontrivial.
While we figured out empirical init strategies fairly early, like Glorot init in 2010, it took until much later that we developed initialization schemes that really Just Worked (He init in 2015 , Dynamical Isometry from Xiao et al 2018)
If I had to blame something, I’d blame GPUs and custom kernel writing getting to the point that small research labs could begin to tinker with ~few million parameter models on essentially single machines + a few GPUs. (The AlexNet model from 2012 was only 60 million parameters!)