Zach’s summary for the Alignment Newsletter (just for the SGD as Bayesian sampler paper):
Neural networks have been shown empirically to generalize well in the overparameterized setting, which suggests that there is an inductive bias for the final learned function to be simple. The obvious next question: does this inductive bias come from the _architecture_ and _initialization_ of the neural network, or does it come from stochastic gradient descent (SGD)? This paper argues that it is primarily the former.
Specifically, if the inductive bias came from SGD, we would expect that bias to go away if we replaced SGD with random sampling. In random sampling, we sample an initialization of the neural network, and if it has zero training error, then we’re done, otherwise we repeat.
The authors explore this hypothesis experimentally on the MNIST, Fashion-MNIST, and IMDb movie review databases. They test on variants of SGD, including Adam, Adagrad, and RMSprop. Since actually running rejection sampling for a dataset would take _way_ too much time, the authors approximate it using a Gaussian Process. This is known to be a good approximation in the large width regime.
Results show that the two probabilities are correlated over a wide order of magnitudes for different architectures, datasets, and optimization methods. While correlation isn’t perfect over all scales, it tends to improve as the frequency of the function increases. In particular, the top few most likely functions tend to have highly correlated probabilities under both generation mechanisms.
Zach’s opinion:
Fundamentally the point here is that generalization performance is explained much more by the neural network architecture, rather than the structure of stochastic gradient descent, since we can see that stochastic gradient descent tends to behave similarly to (an approximation of) random sampling. The paper talks a bunch about things like SGD being (almost) Bayesian and the neural network prior having low Kolmogorov complexity; I found these to be distractions from the main point. Beyond that, approximating the random sampling probability with a Gaussian process is a fairly delicate affair and I have concerns about the applicability to real neural networks.
One way that SGD could differ from random sampling is that SGD will typically only reach the boundary of a region with zero training error, whereas random sampling will sample uniformly within the region. However, in high dimensional settings, most of the volume is near the boundary, so this is not a big deal. I’m not aware of any work that claims SGD uniformly samples from this boundary, but it’s worth considering that possibility if the experimental results hold up.
Rohin’s opinion:
I agree with Zach above about the main point of the paper. One other thing I’d note is that SGD can’t have literally the same outcomes as random sampling, since random sampling wouldn’t display phenomena like <@double descent@>(@Deep Double Descent@). I don’t think this is in conflict with the claim of the paper, which is that _most_ of the inductive bias comes from the architecture and initialization.
[Other](https://arxiv.org/abs/1805.08522) [work](https://arxiv.org/abs/1909.11522) by the same group provides some theoretical and empirical arguments that the neural network prior does have an inductive bias towards simplicity. I find those results suggestive but not conclusive, and am far more persuaded by the paper summarized here, so I don’t expect to summarize them.
Fundamentally the point here is that generalization performance is explained much more by the neural network architecture, rather than the structure of stochastic gradient descent, since we can see that stochastic gradient descent tends to behave similarly to (an approximation of) random sampling. The paper talks a bunch about things like SGD being (almost) Bayesian and the neural network prior having low Kolmogorov complexity; I found these to be distractions from the main point.
The main point, as I see it, is essentially that functions with good generalisation correspond to large volumes in parameter-space, and that SGD finds functions with a probability roughly proportional to their volume. Saying that SGD is “Bayesian” is one way of saying the latter, and the Kolmogorov complexity stuff is a way to formalise some intuitions around the former.
Beyond that, approximating the random sampling probability with a Gaussian process is a fairly delicate affair and I have concerns about the applicability to real neural networks.
This has been done with real neural networks! See this, for example—they use Gaussian Processes on stuff like Mobilenetv2, Densenet121, and Resnet50. It seems to work well.
One way that SGD could differ from random sampling is that SGD will typically only reach the boundary of a region with zero training error, whereas random sampling will sample uniformly within the region. However, in high dimensional settings, most of the volume is near the boundary, so this is not a big deal. I’m not aware of any work that claims SGD uniformly samples from this boundary, but it’s worth considering that possibility if the experimental results hold up.
We have done overtraining, which should allow SGD to penetrate into the region. This doesn’t seem to make much difference for the probabilities we get.
The main point, as I see it, is essentially that functions with good generalisation correspond to large volumes in parameter-space, and that SGD finds functions with a probability roughly proportional to their volume.
This seems right, but I’m not sure how that’s different from Zach’s phrasing of the main point? Zach’s phrasing was “SGD approximately equals random sampling”, and random sampling finds functions with probability exactly proportional to their volume. Combine that with the fact that empirically we get good generalization and we get the thing you said.
(Maybe you weren’t disagreeing with Zach and were just saying the same thing a different way?)
Saying that SGD is “Bayesian” is one way of saying the latter
Saying that MLK was a “criminal” is one way of saying that MLK thought and acted as though he had a moral responsibility to break unjust laws and to take direct action.
(This is an exaggeration but I think it is directionally correct. Certainly when I read the title “neural networks are fundamentally Bayesian” I was thinking of something very different.)
the Kolmogorov complexity stuff is a way to formalise some intuitions around the former.
I’ve discussed this above, I’ll continue the discussion there.
The rest of the comment is about stuff that I didn’t have a strong opinion on, so I’ll leave it for Zach to answer if he wants.
(Maybe you weren’t disagreeing with Zach and were just saying the same thing a different way?)
I’m honestly not sure, I just wasn’t really sure what he meant when he said that the Bayesian and the Kolmogorov complexity stuff were “distractions from the main point”.
Saying that MLK was a “criminal” is one way of saying that MLK thought and acted as though he had a moral responsibility to break unjust laws and to take direct action.
(This is an exaggeration but I think it is directionally correct. Certainly when I read the title “neural networks are fundamentally Bayesian” I was thinking of something very different.)
Haha. That’s obviously not what we’re trying to do here, but I do see what you mean. I originally wanted to express these ideas in more geometric language, rather than probability-theoretic language, but in the end we decided to go for more probability-theoretic language anyway.
I agree that this arguably could be mildly misleading. For example, the correspondence between SGD and Bayesian sampling only really holds for some initialisation distributions. If you deterministically initialise your neural network to the origin (i.e., all zero weights) then SGD will most certainly not behave like Bayesian sampling with the initialisation distribution as its prior. Then again, the probability-theoretic formulation might make other things more intuitive.
What I’m suggesting is that volume in high-dimensions can concentrate on the boundary.
Yes. I imagine this is why overtraining doesn’t make a huge difference.
Falsifiable Hypothesis: Compare SGD with overtaining to the random sampling algorithm. You will see that functions that are unlikely to be generated by random sampling will be more likely under SGD with overtraining. Moreover, functions that are more likely with random sampling will be become less likely under SGD with overtraining.
Zach’s summary for the Alignment Newsletter (just for the SGD as Bayesian sampler paper):
Zach’s opinion:
Rohin’s opinion:
I have a few comments on this:
The main point, as I see it, is essentially that functions with good generalisation correspond to large volumes in parameter-space, and that SGD finds functions with a probability roughly proportional to their volume. Saying that SGD is “Bayesian” is one way of saying the latter, and the Kolmogorov complexity stuff is a way to formalise some intuitions around the former.
This has been done with real neural networks! See this, for example—they use Gaussian Processes on stuff like Mobilenetv2, Densenet121, and Resnet50. It seems to work well.
We have done overtraining, which should allow SGD to penetrate into the region. This doesn’t seem to make much difference for the probabilities we get.
I basically agree with what you say here.
This seems right, but I’m not sure how that’s different from Zach’s phrasing of the main point? Zach’s phrasing was “SGD approximately equals random sampling”, and random sampling finds functions with probability exactly proportional to their volume. Combine that with the fact that empirically we get good generalization and we get the thing you said.
(Maybe you weren’t disagreeing with Zach and were just saying the same thing a different way?)
This feels similar to:
Saying that MLK was a “criminal” is one way of saying that MLK thought and acted as though he had a moral responsibility to break unjust laws and to take direct action.
(This is an exaggeration but I think it is directionally correct. Certainly when I read the title “neural networks are fundamentally Bayesian” I was thinking of something very different.)
I’ve discussed this above, I’ll continue the discussion there.
The rest of the comment is about stuff that I didn’t have a strong opinion on, so I’ll leave it for Zach to answer if he wants.
I’m honestly not sure, I just wasn’t really sure what he meant when he said that the Bayesian and the Kolmogorov complexity stuff were “distractions from the main point”.
Haha. That’s obviously not what we’re trying to do here, but I do see what you mean. I originally wanted to express these ideas in more geometric language, rather than probability-theoretic language, but in the end we decided to go for more probability-theoretic language anyway.
I agree that this arguably could be mildly misleading. For example, the correspondence between SGD and Bayesian sampling only really holds for some initialisation distributions. If you deterministically initialise your neural network to the origin (i.e., all zero weights) then SGD will most certainly not behave like Bayesian sampling with the initialisation distribution as its prior. Then again, the probability-theoretic formulation might make other things more intuitive.
[Deleted]
Yes. I imagine this is why overtraining doesn’t make a huge difference.
See e.g., page 47 in the main paper.
[Deleted]