Chris (author of the linked post) messaged me about these, and I also looked at the full paper and thought about it some more. There are still some stray pieces, but I’m now generally convinced the headline claim is correct: the vast majority of inductive bias comes from the implicit prior on the network’s parameter space.
Shortest path I see to that conclusion, from the figures, is a Fermi estimate:
figure 5b (in the post; figure 1b in paper) shows that functions with Bayesian posterior above ~10^-10 have typical generalization error on the order of 5% (hard to get more granular than that)
figure 5a shows that SGD typically chooses functions with Bayesian posterior above ~10^-10 (remember the x-axis is a cutoff here), and functions with higher Bayesian posterior are generally chosen more often by SGD
So, if the Bayesian posterior fully accounts for generalization error and SGD is not contributing any further inductive bias at all, we’d expect to see generalization error on the order of 5% from SGD.
Looking at figure 5d, we do indeed see generalization error on the order of 5% from SGD samples (the y-axis is not relevant to the estimate here).
This isn’t precise enough to rule out small contributions of SGD to the inductive bias, but it is pretty strong evidence that the Bayesian posterior is the main factor.
Other things I’m updating on, based on the post/paper:
This makes it very likely that gaussian processes, as a theoretical model, capture most of what makes DNNs work in practice
This makes it very likely that DNNs are “only doing interpolation”, in some sense, as opposed to extrapolation. (This already seemed fairly likely based on scaling curves, and the gaussian process model gives us a second line of evidence.)
I should note that SGD definitely does make a contribution to the inductive bias, but this contribution does seem to be quite small compared to the contribution that comes from the implicit prior embedded in the parameter-function map. For example, if you look at Figure 6 in the Towards Data Science post, you can see that different versions of SGD give a slightly different inductive bias. It’s also well-known that the inductive bias of neural networks is affected by things like how long you train for, and what step size and batch size you use, etc. However, these effects seem to be quite small compared to the effect that comes from the parameter-function map.
I should also note that I think that the fact that Gaussian processes even work at all already in itself gives us a fairly good reason to expect them to capture most of what makes NNs work in practice. For any given function approximator, if that function approximator is highly expressive then the “null hypothesis” should be that it basically won’t generalise at all. The fact that NNs and GPs both work, and the fact that there is a fairly strong correspondence between them, means that it would be kind of weird if they worked for completely different reasons.
I’d be interested to hear your take on why this means NN’s are only doing interpolation. What does it mean, to only do interpolation and not extrapolation? I know the toy model definitions of those terms (connecting dots vs. drawing a line off away from your dots) but what does it mean in real life problems? It seems like a fuzzy/graded distinction to me, at best.
Also, if the simplicity prior they use is akin to kolmogorov complexity-based priors, then that means what they are doing is akin to what e.g. Solomonoff Induction does. And I’ve never heard anyone try to argue that Solomonoff Induction “merely interpolates” before!
Chris (author of the linked post) messaged me about these, and I also looked at the full paper and thought about it some more. There are still some stray pieces, but I’m now generally convinced the headline claim is correct: the vast majority of inductive bias comes from the implicit prior on the network’s parameter space.
Shortest path I see to that conclusion, from the figures, is a Fermi estimate:
figure 5b (in the post; figure 1b in paper) shows that functions with Bayesian posterior above ~10^-10 have typical generalization error on the order of 5% (hard to get more granular than that)
figure 5a shows that SGD typically chooses functions with Bayesian posterior above ~10^-10 (remember the x-axis is a cutoff here), and functions with higher Bayesian posterior are generally chosen more often by SGD
So, if the Bayesian posterior fully accounts for generalization error and SGD is not contributing any further inductive bias at all, we’d expect to see generalization error on the order of 5% from SGD.
Looking at figure 5d, we do indeed see generalization error on the order of 5% from SGD samples (the y-axis is not relevant to the estimate here).
This isn’t precise enough to rule out small contributions of SGD to the inductive bias, but it is pretty strong evidence that the Bayesian posterior is the main factor.
Other things I’m updating on, based on the post/paper:
This makes it very likely that gaussian processes, as a theoretical model, capture most of what makes DNNs work in practice
This makes it very likely that DNNs are “only doing interpolation”, in some sense, as opposed to extrapolation. (This already seemed fairly likely based on scaling curves, and the gaussian process model gives us a second line of evidence.)
Yes, I agree with this.
I should note that SGD definitely does make a contribution to the inductive bias, but this contribution does seem to be quite small compared to the contribution that comes from the implicit prior embedded in the parameter-function map. For example, if you look at Figure 6 in the Towards Data Science post, you can see that different versions of SGD give a slightly different inductive bias. It’s also well-known that the inductive bias of neural networks is affected by things like how long you train for, and what step size and batch size you use, etc. However, these effects seem to be quite small compared to the effect that comes from the parameter-function map.
I should also note that I think that the fact that Gaussian processes even work at all already in itself gives us a fairly good reason to expect them to capture most of what makes NNs work in practice. For any given function approximator, if that function approximator is highly expressive then the “null hypothesis” should be that it basically won’t generalise at all. The fact that NNs and GPs both work, and the fact that there is a fairly strong correspondence between them, means that it would be kind of weird if they worked for completely different reasons.
I’d be interested to hear your take on why this means NN’s are only doing interpolation. What does it mean, to only do interpolation and not extrapolation? I know the toy model definitions of those terms (connecting dots vs. drawing a line off away from your dots) but what does it mean in real life problems? It seems like a fuzzy/graded distinction to me, at best.
Also, if the simplicity prior they use is akin to kolmogorov complexity-based priors, then that means what they are doing is akin to what e.g. Solomonoff Induction does. And I’ve never heard anyone try to argue that Solomonoff Induction “merely interpolates” before!