There are various priors over functions for which we can calculate the exact posterior. (E.g., Gaussian processes.) However, doing Bayesian inference on these priors doesn’t perform as well as neural networks on most datasets. So knowing SGD is Bayesian is only interesting if we also know that the prior is interesting. I think the ideal theoretical result would be to show that SGD on neural nets is an approximation of Solomonoff Induction (or something like SI), and the approximation gets better as the NNs get bigger and deeper. But I have yet to see any theory that connects neural nets/ SGD to something like short programs.
If SGD works because it’s Bayesian, then making it more Bayesian should make it work better. But according to https://arxiv.org/abs/2002.02405 that’s not the case. Lowering the temperature, or taking the MAP (=temperature 0) generalizes better than taking the full Bayesian posterior, as calculated by an expensive MCMC procedure.
We do have empirical data which shows that the neural network “prior” is biased towards low-complexity functions, and some arguments for why it would make sense to expect this to be the case—see this new blog post, and my comment here. Essentially, low-complexity functions correspond to larger volumes in the parameter-space of neural networks. If functions with large volumes also have large basins of attraction, and if using SGD is roughly equivalent to going down a random basin (weighted by its size), then this would essentially explain why neural networks work.
I haven’t seen the paper you link, so I can’t comment on it specifically, but I want to note that the claim “SGD is roughly Bayesian” does not imply “Bayesian inference would give better generalisation than SGD”. It can simultaneously be the case that the neural network “prior” is biased towards low-complexity functions, that SGD roughly follows the “prior”, and that SGD provides some additional bias towards low-complexity functions (over and above what is provided by the “prior”). For example, if you look at Figure 6 in the post I link, you can see that different versions of SGD do provide a slightly different inductive bias. However, this effect seems to be quite small relative to what is provided by the “prior”.
The results in Neural Networks Are Fundamentally Bayesian are pretty cool—that’s clever how they were able to estimate the densities.
A couple thoughts on the limitations:
There are various priors over functions for which we can calculate the exact posterior. (E.g., Gaussian processes.) However, doing Bayesian inference on these priors doesn’t perform as well as neural networks on most datasets. So knowing SGD is Bayesian is only interesting if we also know that the prior is interesting. I think the ideal theoretical result would be to show that SGD on neural nets is an approximation of Solomonoff Induction (or something like SI), and the approximation gets better as the NNs get bigger and deeper. But I have yet to see any theory that connects neural nets/ SGD to something like short programs.
If SGD works because it’s Bayesian, then making it more Bayesian should make it work better. But according to https://arxiv.org/abs/2002.02405 that’s not the case. Lowering the temperature, or taking the MAP (=temperature 0) generalizes better than taking the full Bayesian posterior, as calculated by an expensive MCMC procedure.
We do have empirical data which shows that the neural network “prior” is biased towards low-complexity functions, and some arguments for why it would make sense to expect this to be the case—see this new blog post, and my comment here. Essentially, low-complexity functions correspond to larger volumes in the parameter-space of neural networks. If functions with large volumes also have large basins of attraction, and if using SGD is roughly equivalent to going down a random basin (weighted by its size), then this would essentially explain why neural networks work.
I haven’t seen the paper you link, so I can’t comment on it specifically, but I want to note that the claim “SGD is roughly Bayesian” does not imply “Bayesian inference would give better generalisation than SGD”. It can simultaneously be the case that the neural network “prior” is biased towards low-complexity functions, that SGD roughly follows the “prior”, and that SGD provides some additional bias towards low-complexity functions (over and above what is provided by the “prior”). For example, if you look at Figure 6 in the post I link, you can see that different versions of SGD do provide a slightly different inductive bias. However, this effect seems to be quite small relative to what is provided by the “prior”.