Consider first the more basic question: why is simple SGD on over-parameterized ANNs an effective global optimizer? This is the first great mystery of ANNs from classical ML theory: they should get stuck in various local minima and or overfit, but generally they don’t (with a few tweaks) and just work better and better with scale. Many other techniques generally don’t have this property.
A large oversized ANN can encode not just a single circuit solution, but an entire ensemble of candidates circuits (which dropout makes more explicit), and SGD then explores an entire ensemble of solutions in parallel automatically reusing all shared subcomputations resulting in an exponential speedup vs explicitly evaluating every candidate individually (as in most program search methods).
Pruning and lottery tickets are then obvious—after training the ensemble you can always extract one or more of the sparser best candidate circuits. Its also well known that you can compress and reduce bit precision extensively, so the first bit is by far the most important, and just retaining that bit and retraining the rest (masking) should reduce most of the retraining work.
SGD also has an obvious inductive bias, simply because it updates the weights according to dL/dW, ie the direction that maximizes loss reduction per unit weight change. That’s not the direction that moves directly to the lowest loss region (higher order optimizer find that direction); SGD instead moves in a direction that maximizes loss reduction per bit/unit of weight (and thus complexity) gain.
Consider first the more basic question: why is simple SGD on over-parameterized ANNs an effective global optimizer? This is the first great mystery of ANNs from classical ML theory: they should get stuck in various local minima and or overfit, but generally they don’t (with a few tweaks) and just work better and better with scale. Many other techniques generally don’t have this property.
A large oversized ANN can encode not just a single circuit solution, but an entire ensemble of candidates circuits (which dropout makes more explicit), and SGD then explores an entire ensemble of solutions in parallel automatically reusing all shared subcomputations resulting in an exponential speedup vs explicitly evaluating every candidate individually (as in most program search methods).
Pruning and lottery tickets are then obvious—after training the ensemble you can always extract one or more of the sparser best candidate circuits. Its also well known that you can compress and reduce bit precision extensively, so the first bit is by far the most important, and just retaining that bit and retraining the rest (masking) should reduce most of the retraining work.
SGD also has an obvious inductive bias, simply because it updates the weights according to dL/dW, ie the direction that maximizes loss reduction per unit weight change. That’s not the direction that moves directly to the lowest loss region (higher order optimizer find that direction); SGD instead moves in a direction that maximizes loss reduction per bit/unit of weight (and thus complexity) gain.