First Confusion: Why does the experimental test involve any test data at all?
It seems like Pβ(f|S) and Popt(f|S) are denoting different things in different places. Writing out the model from the first few sections explicitly:
We have a network Y=N(θ,X), where θ are the parameters and (X,Y) are input, output
Any distribution P(θ) then gives a prior distribution on X→Y functions P(f)=∫θI[∀X:f(X)=N(θ,X)]P(θ), where I is an indicator function
Given the training data S, we can update in two ways: Bayesian update or SGD
Bayes update is: Pβ(f|S)=1ZI[∀(x,y)∈S:y=f(x)]P[f]
SGD update is: Popt(f|S)=1ZI[∀X:f(X)=N(SGD(S,θ),X)]P(θ), where SGD(S,θ) is the “optimal” θ-value spit out by SGD on data S when initialized at θ.
Now for the key confusion: the claim that Pβ(f|S)≈Popt(f|S) sounds like it should apply to the definitions above, for all functions f (or at least with high probability), without any reference whatsoever to any test data.
So how does the test data enter? It sounds like the experiments evaluate P[Y=f(X)], for (X,Y) values from the test set, under the two f-distributions above. This is a very different notion of “Pβ(f|S)≈Popt(f|S)” from above, and it’s not obvious whether it’s relevant at all to understanding DNN performance. It’s asking “are the probabilities assigned to the test data approximately the same?” rather than “are the distributions of trained functions approximately the same?”.
My optimistic guess as to what’s going on here: the “test set” is completely ignoring the test labels, and instead choosing random labels. This would be equivalent to comparing Pβ(f|S) to Popt(f|S) at random f, but only on a fixed subset of 100 possible X-values (drawn from the true distribution). If that’s what’s going on, then it is asking “are the distributions of trained functions approximately the same on realistic X-values?”, and it’s just confusing to the reader to talk about these random functions as coming from a “test set”. Not a substantive problem, just communication difficulty.
Second Confusion: How does the stated conclusion follow from the figures?
This confusion is more prosaic: I’m not convinced that the conclusion Pβ(f|S)≈Popt(f|S) follows from the figures, at least not to enough of an extent to be “strong evidence” against the claim that SGD is the main source of induction bias.
Many of these figures show awfully large divergence from equality, and I’m not seeing any statistical measure of fit. Eyeballing them, it’s clear that there’s a strong enough relation to rule in inductive bias in the network architecture, but that does not rule out inductive bias in SGD as well. To make the latter claim, there would have to be statistically-small residual inductive bias after accounting for the contribution of bias from Pβ(f|S), and I don’t see the paper actually making that case. I find the claim plausible a priori, but I don’t see the right analysis here to provide significant evidence for it.
Chris (author of the linked post) messaged me about these, and I also looked at the full paper and thought about it some more. There are still some stray pieces, but I’m now generally convinced the headline claim is correct: the vast majority of inductive bias comes from the implicit prior on the network’s parameter space.
Shortest path I see to that conclusion, from the figures, is a Fermi estimate:
figure 5b (in the post; figure 1b in paper) shows that functions with Bayesian posterior above ~10^-10 have typical generalization error on the order of 5% (hard to get more granular than that)
figure 5a shows that SGD typically chooses functions with Bayesian posterior above ~10^-10 (remember the x-axis is a cutoff here), and functions with higher Bayesian posterior are generally chosen more often by SGD
So, if the Bayesian posterior fully accounts for generalization error and SGD is not contributing any further inductive bias at all, we’d expect to see generalization error on the order of 5% from SGD.
Looking at figure 5d, we do indeed see generalization error on the order of 5% from SGD samples (the y-axis is not relevant to the estimate here).
This isn’t precise enough to rule out small contributions of SGD to the inductive bias, but it is pretty strong evidence that the Bayesian posterior is the main factor.
Other things I’m updating on, based on the post/paper:
This makes it very likely that gaussian processes, as a theoretical model, capture most of what makes DNNs work in practice
This makes it very likely that DNNs are “only doing interpolation”, in some sense, as opposed to extrapolation. (This already seemed fairly likely based on scaling curves, and the gaussian process model gives us a second line of evidence.)
I should note that SGD definitely does make a contribution to the inductive bias, but this contribution does seem to be quite small compared to the contribution that comes from the implicit prior embedded in the parameter-function map. For example, if you look at Figure 6 in the Towards Data Science post, you can see that different versions of SGD give a slightly different inductive bias. It’s also well-known that the inductive bias of neural networks is affected by things like how long you train for, and what step size and batch size you use, etc. However, these effects seem to be quite small compared to the effect that comes from the parameter-function map.
I should also note that I think that the fact that Gaussian processes even work at all already in itself gives us a fairly good reason to expect them to capture most of what makes NNs work in practice. For any given function approximator, if that function approximator is highly expressive then the “null hypothesis” should be that it basically won’t generalise at all. The fact that NNs and GPs both work, and the fact that there is a fairly strong correspondence between them, means that it would be kind of weird if they worked for completely different reasons.
I’d be interested to hear your take on why this means NN’s are only doing interpolation. What does it mean, to only do interpolation and not extrapolation? I know the toy model definitions of those terms (connecting dots vs. drawing a line off away from your dots) but what does it mean in real life problems? It seems like a fuzzy/graded distinction to me, at best.
Also, if the simplicity prior they use is akin to kolmogorov complexity-based priors, then that means what they are doing is akin to what e.g. Solomonoff Induction does. And I’ve never heard anyone try to argue that Solomonoff Induction “merely interpolates” before!
I believe Chris has now updated the Towards Data Science blog post to be more clear, based on the conversation you had, but I’ll make some clarifications here as well, for the benefit of others:
The key claim, that Pβ(f∣S)≈Popt(f∣S), is indeed not (meant to be) dependent on any test data per se. The test data comes into the picture because we need a way to granuralise the space of possible functions if we want to compare these two quantities empirically. If we take “the space of functions” to be all the functions that a given neural network can express on the entire vector space in which it is defined, then there would be an uncountably infinite number of such functions, and any given function would never show up more than once in any kind of experiment we could do. We therefore need a way to lump the functions together into sensible buckets, and we decided to do that by looking at what output the function gives on a set of images not used in training. Stated differently, we look at the partial function that the network expresses on a particular subset of the input vector space, consisting in a bunch of points sampled from the underlying data distribution. So, basically, your optimistic guess is correct—the test data is only used as a way to lump functions together into a finite number of sensible buckets (and the test labels are not used).
Two confusions about what this paper is claiming.
First Confusion: Why does the experimental test involve any test data at all?
It seems like Pβ(f|S) and Popt(f|S) are denoting different things in different places. Writing out the model from the first few sections explicitly:
We have a network Y=N(θ,X), where θ are the parameters and (X,Y) are input, output
Any distribution P(θ) then gives a prior distribution on X→Y functions P(f)=∫θI[∀X:f(X)=N(θ,X)]P(θ), where I is an indicator function
Given the training data S, we can update in two ways: Bayesian update or SGD
Bayes update is: Pβ(f|S)=1ZI[∀(x,y)∈S:y=f(x)]P[f]
SGD update is: Popt(f|S)=1ZI[∀X:f(X)=N(SGD(S,θ),X)]P(θ), where SGD(S,θ) is the “optimal” θ-value spit out by SGD on data S when initialized at θ.
Now for the key confusion: the claim that Pβ(f|S)≈Popt(f|S) sounds like it should apply to the definitions above, for all functions f (or at least with high probability), without any reference whatsoever to any test data.
So how does the test data enter? It sounds like the experiments evaluate P[Y=f(X)], for (X,Y) values from the test set, under the two f-distributions above. This is a very different notion of “Pβ(f|S)≈Popt(f|S)” from above, and it’s not obvious whether it’s relevant at all to understanding DNN performance. It’s asking “are the probabilities assigned to the test data approximately the same?” rather than “are the distributions of trained functions approximately the same?”.
My optimistic guess as to what’s going on here: the “test set” is completely ignoring the test labels, and instead choosing random labels. This would be equivalent to comparing Pβ(f|S) to Popt(f|S) at random f, but only on a fixed subset of 100 possible X-values (drawn from the true distribution). If that’s what’s going on, then it is asking “are the distributions of trained functions approximately the same on realistic X-values?”, and it’s just confusing to the reader to talk about these random functions as coming from a “test set”. Not a substantive problem, just communication difficulty.
Second Confusion: How does the stated conclusion follow from the figures?
This confusion is more prosaic: I’m not convinced that the conclusion Pβ(f|S)≈Popt(f|S) follows from the figures, at least not to enough of an extent to be “strong evidence” against the claim that SGD is the main source of induction bias.
Many of these figures show awfully large divergence from equality, and I’m not seeing any statistical measure of fit. Eyeballing them, it’s clear that there’s a strong enough relation to rule in inductive bias in the network architecture, but that does not rule out inductive bias in SGD as well. To make the latter claim, there would have to be statistically-small residual inductive bias after accounting for the contribution of bias from Pβ(f|S), and I don’t see the paper actually making that case. I find the claim plausible a priori, but I don’t see the right analysis here to provide significant evidence for it.
Chris (author of the linked post) messaged me about these, and I also looked at the full paper and thought about it some more. There are still some stray pieces, but I’m now generally convinced the headline claim is correct: the vast majority of inductive bias comes from the implicit prior on the network’s parameter space.
Shortest path I see to that conclusion, from the figures, is a Fermi estimate:
figure 5b (in the post; figure 1b in paper) shows that functions with Bayesian posterior above ~10^-10 have typical generalization error on the order of 5% (hard to get more granular than that)
figure 5a shows that SGD typically chooses functions with Bayesian posterior above ~10^-10 (remember the x-axis is a cutoff here), and functions with higher Bayesian posterior are generally chosen more often by SGD
So, if the Bayesian posterior fully accounts for generalization error and SGD is not contributing any further inductive bias at all, we’d expect to see generalization error on the order of 5% from SGD.
Looking at figure 5d, we do indeed see generalization error on the order of 5% from SGD samples (the y-axis is not relevant to the estimate here).
This isn’t precise enough to rule out small contributions of SGD to the inductive bias, but it is pretty strong evidence that the Bayesian posterior is the main factor.
Other things I’m updating on, based on the post/paper:
This makes it very likely that gaussian processes, as a theoretical model, capture most of what makes DNNs work in practice
This makes it very likely that DNNs are “only doing interpolation”, in some sense, as opposed to extrapolation. (This already seemed fairly likely based on scaling curves, and the gaussian process model gives us a second line of evidence.)
Yes, I agree with this.
I should note that SGD definitely does make a contribution to the inductive bias, but this contribution does seem to be quite small compared to the contribution that comes from the implicit prior embedded in the parameter-function map. For example, if you look at Figure 6 in the Towards Data Science post, you can see that different versions of SGD give a slightly different inductive bias. It’s also well-known that the inductive bias of neural networks is affected by things like how long you train for, and what step size and batch size you use, etc. However, these effects seem to be quite small compared to the effect that comes from the parameter-function map.
I should also note that I think that the fact that Gaussian processes even work at all already in itself gives us a fairly good reason to expect them to capture most of what makes NNs work in practice. For any given function approximator, if that function approximator is highly expressive then the “null hypothesis” should be that it basically won’t generalise at all. The fact that NNs and GPs both work, and the fact that there is a fairly strong correspondence between them, means that it would be kind of weird if they worked for completely different reasons.
I’d be interested to hear your take on why this means NN’s are only doing interpolation. What does it mean, to only do interpolation and not extrapolation? I know the toy model definitions of those terms (connecting dots vs. drawing a line off away from your dots) but what does it mean in real life problems? It seems like a fuzzy/graded distinction to me, at best.
Also, if the simplicity prior they use is akin to kolmogorov complexity-based priors, then that means what they are doing is akin to what e.g. Solomonoff Induction does. And I’ve never heard anyone try to argue that Solomonoff Induction “merely interpolates” before!
I believe Chris has now updated the Towards Data Science blog post to be more clear, based on the conversation you had, but I’ll make some clarifications here as well, for the benefit of others:
The key claim, that Pβ(f∣S)≈Popt(f∣S), is indeed not (meant to be) dependent on any test data per se. The test data comes into the picture because we need a way to granuralise the space of possible functions if we want to compare these two quantities empirically. If we take “the space of functions” to be all the functions that a given neural network can express on the entire vector space in which it is defined, then there would be an uncountably infinite number of such functions, and any given function would never show up more than once in any kind of experiment we could do. We therefore need a way to lump the functions together into sensible buckets, and we decided to do that by looking at what output the function gives on a set of images not used in training. Stated differently, we look at the partial function that the network expresses on a particular subset of the input vector space, consisting in a bunch of points sampled from the underlying data distribution. So, basically, your optimistic guess is correct—the test data is only used as a way to lump functions together into a finite number of sensible buckets (and the test labels are not used).