I haven’t walked through your math carefully, but I find this type of analysis interesting.
SGD is believed to have certain “bias” towards low-entropy models of the world. Part of this is a preference for “broader” rather than “narrower” minima in L. Now we have some tools which may allow us to understand this. Under this model, SGD is also biased towards regions of low variance in loss function.
This bias towards regions of low variance makes intuitive sense.
SGD’s bias towards low-entropy models also has a simple explanation—good inits start it in a low entropy config, and SGD moves in an entropy efficient direction of maximizing loss decrease per unit weight change, which biases it strongly towards staying near the low entropy init. This becomes quite noticeable when you experiment with 2nd order optimizers which generally don’t have this bias—they tend to overfit far more easily and need more explicit regularization.
I haven’t walked through your math carefully, but I find this type of analysis interesting.
This bias towards regions of low variance makes intuitive sense.
SGD’s bias towards low-entropy models also has a simple explanation—good inits start it in a low entropy config, and SGD moves in an entropy efficient direction of maximizing loss decrease per unit weight change, which biases it strongly towards staying near the low entropy init. This becomes quite noticeable when you experiment with 2nd order optimizers which generally don’t have this bias—they tend to overfit far more easily and need more explicit regularization.