Fundamentally the point here is that generalization performance is explained much more by the neural network architecture, rather than the structure of stochastic gradient descent, since we can see that stochastic gradient descent tends to behave similarly to (an approximation of) random sampling. The paper talks a bunch about things like SGD being (almost) Bayesian and the neural network prior having low Kolmogorov complexity; I found these to be distractions from the main point.
The main point, as I see it, is essentially that functions with good generalisation correspond to large volumes in parameter-space, and that SGD finds functions with a probability roughly proportional to their volume. Saying that SGD is “Bayesian” is one way of saying the latter, and the Kolmogorov complexity stuff is a way to formalise some intuitions around the former.
Beyond that, approximating the random sampling probability with a Gaussian process is a fairly delicate affair and I have concerns about the applicability to real neural networks.
This has been done with real neural networks! See this, for example—they use Gaussian Processes on stuff like Mobilenetv2, Densenet121, and Resnet50. It seems to work well.
One way that SGD could differ from random sampling is that SGD will typically only reach the boundary of a region with zero training error, whereas random sampling will sample uniformly within the region. However, in high dimensional settings, most of the volume is near the boundary, so this is not a big deal. I’m not aware of any work that claims SGD uniformly samples from this boundary, but it’s worth considering that possibility if the experimental results hold up.
We have done overtraining, which should allow SGD to penetrate into the region. This doesn’t seem to make much difference for the probabilities we get.
The main point, as I see it, is essentially that functions with good generalisation correspond to large volumes in parameter-space, and that SGD finds functions with a probability roughly proportional to their volume.
This seems right, but I’m not sure how that’s different from Zach’s phrasing of the main point? Zach’s phrasing was “SGD approximately equals random sampling”, and random sampling finds functions with probability exactly proportional to their volume. Combine that with the fact that empirically we get good generalization and we get the thing you said.
(Maybe you weren’t disagreeing with Zach and were just saying the same thing a different way?)
Saying that SGD is “Bayesian” is one way of saying the latter
Saying that MLK was a “criminal” is one way of saying that MLK thought and acted as though he had a moral responsibility to break unjust laws and to take direct action.
(This is an exaggeration but I think it is directionally correct. Certainly when I read the title “neural networks are fundamentally Bayesian” I was thinking of something very different.)
the Kolmogorov complexity stuff is a way to formalise some intuitions around the former.
I’ve discussed this above, I’ll continue the discussion there.
The rest of the comment is about stuff that I didn’t have a strong opinion on, so I’ll leave it for Zach to answer if he wants.
(Maybe you weren’t disagreeing with Zach and were just saying the same thing a different way?)
I’m honestly not sure, I just wasn’t really sure what he meant when he said that the Bayesian and the Kolmogorov complexity stuff were “distractions from the main point”.
Saying that MLK was a “criminal” is one way of saying that MLK thought and acted as though he had a moral responsibility to break unjust laws and to take direct action.
(This is an exaggeration but I think it is directionally correct. Certainly when I read the title “neural networks are fundamentally Bayesian” I was thinking of something very different.)
Haha. That’s obviously not what we’re trying to do here, but I do see what you mean. I originally wanted to express these ideas in more geometric language, rather than probability-theoretic language, but in the end we decided to go for more probability-theoretic language anyway.
I agree that this arguably could be mildly misleading. For example, the correspondence between SGD and Bayesian sampling only really holds for some initialisation distributions. If you deterministically initialise your neural network to the origin (i.e., all zero weights) then SGD will most certainly not behave like Bayesian sampling with the initialisation distribution as its prior. Then again, the probability-theoretic formulation might make other things more intuitive.
What I’m suggesting is that volume in high-dimensions can concentrate on the boundary.
Yes. I imagine this is why overtraining doesn’t make a huge difference.
Falsifiable Hypothesis: Compare SGD with overtaining to the random sampling algorithm. You will see that functions that are unlikely to be generated by random sampling will be more likely under SGD with overtraining. Moreover, functions that are more likely with random sampling will be become less likely under SGD with overtraining.
I have a few comments on this:
The main point, as I see it, is essentially that functions with good generalisation correspond to large volumes in parameter-space, and that SGD finds functions with a probability roughly proportional to their volume. Saying that SGD is “Bayesian” is one way of saying the latter, and the Kolmogorov complexity stuff is a way to formalise some intuitions around the former.
This has been done with real neural networks! See this, for example—they use Gaussian Processes on stuff like Mobilenetv2, Densenet121, and Resnet50. It seems to work well.
We have done overtraining, which should allow SGD to penetrate into the region. This doesn’t seem to make much difference for the probabilities we get.
I basically agree with what you say here.
This seems right, but I’m not sure how that’s different from Zach’s phrasing of the main point? Zach’s phrasing was “SGD approximately equals random sampling”, and random sampling finds functions with probability exactly proportional to their volume. Combine that with the fact that empirically we get good generalization and we get the thing you said.
(Maybe you weren’t disagreeing with Zach and were just saying the same thing a different way?)
This feels similar to:
Saying that MLK was a “criminal” is one way of saying that MLK thought and acted as though he had a moral responsibility to break unjust laws and to take direct action.
(This is an exaggeration but I think it is directionally correct. Certainly when I read the title “neural networks are fundamentally Bayesian” I was thinking of something very different.)
I’ve discussed this above, I’ll continue the discussion there.
The rest of the comment is about stuff that I didn’t have a strong opinion on, so I’ll leave it for Zach to answer if he wants.
I’m honestly not sure, I just wasn’t really sure what he meant when he said that the Bayesian and the Kolmogorov complexity stuff were “distractions from the main point”.
Haha. That’s obviously not what we’re trying to do here, but I do see what you mean. I originally wanted to express these ideas in more geometric language, rather than probability-theoretic language, but in the end we decided to go for more probability-theoretic language anyway.
I agree that this arguably could be mildly misleading. For example, the correspondence between SGD and Bayesian sampling only really holds for some initialisation distributions. If you deterministically initialise your neural network to the origin (i.e., all zero weights) then SGD will most certainly not behave like Bayesian sampling with the initialisation distribution as its prior. Then again, the probability-theoretic formulation might make other things more intuitive.
[Deleted]
Yes. I imagine this is why overtraining doesn’t make a huge difference.
See e.g., page 47 in the main paper.
[Deleted]