This one is interesting. It argues that the regularization properties are not in SGD, but rather in the NN parameterization, and that non-gradient optimizers also find simple solutions which generalize well. They talk about Bayes only in a paragraph in page 3. They say that literature that argues that NNs work well because they’re Bayesian is related (which is true—it’s also about generalization and volumes). But I see little evidence that the explanation in this paper is an appeal to Bayesian thinking. A simple question for you: what prior distribution do the NNs have, according to the findings in this paper?
In brief: In weight space, uniform. In function space, its an open problem and the paper says relatively little about that. Only showing that conditioning on a function with zero loss, and weighing by its corresponding size in the weight space gets you the same result as training a neural network. The former process is sampling from a bayesian posterior.
Less brief: The prior assigns uniform probability to all weights, and I believe a good understanding of the mapping from weights to functions is unknown, though lots of the time there are many directions you can move in in the weight space which don’t change your function, so one would expect its a relatively compressive mapping (in contrast to, say, a polynomial parameterization, where the mapping is one-to-one).
will say more about your other comment later (maybe).
EDIT: Actually, there should be a term for the stochasticity which you integrate into the SLT equations like you would temperature in a physical system. I don’t remember exactly how this works though. Or if its even known the exact connection with SGD.
In brief: In weight space, uniform. In function space, its an open problem and the paper says relatively little about that. Only showing that conditioning on a function with zero loss, and weighing by its corresponding size in the weight space gets you the same result as training a neural network. The former process is sampling from a bayesian posterior.
Less brief: The prior assigns uniform probability to all weights, and I believe a good understanding of the mapping from weights to functions is unknown, though lots of the time there are many directions you can move in in the weight space which don’t change your function, so one would expect its a relatively compressive mapping (in contrast to, say, a polynomial parameterization, where the mapping is one-to-one).
will say more about your other comment later (maybe).
EDIT: Actually, there should be a term for the stochasticity which you integrate into the SLT equations like you would temperature in a physical system. I don’t remember exactly how this works though. Or if its even known the exact connection with SGD.