I don’t think I understand your model of why neural networks are so effective. It sounds like you say that on the one hand neural networks have lots of parameters, so you should expect them to be terrible, but they are actually very good because SGD is a such a shitty optimizer on the other hand that it acts as an implicit regularizer.
Yeah, that’s basically my model. How it regularizes I don’t know. Perhaps the volume of “simple” functions is the main driver of this, rather than gradient descent dynamics. I think the randomness of it is important; full-gradient descent (no stochasticity) would not work nearly as well.
Oh this reminded me of the temperature component of SLT, which I believe modulates how sharply one should sample from the bayesian posterior, or perhaps how heavily to update on new evidence. I forget. In any case, it does this to try to capture the stochasticity component of SGD. Its still an open problem to show how successfully though, I believe.
Yeah, that’s basically my model. How it regularizes I don’t know. Perhaps the volume of “simple” functions is the main driver of this, rather than gradient descent dynamics. I think the randomness of it is important; full-gradient descent (no stochasticity) would not work nearly as well.
Oh this reminded me of the temperature component of SLT, which I believe modulates how sharply one should sample from the bayesian posterior, or perhaps how heavily to update on new evidence. I forget. In any case, it does this to try to capture the stochasticity component of SGD. Its still an open problem to show how successfully though, I believe.