I should note that SGD definitely does make a contribution to the inductive bias, but this contribution does seem to be quite small compared to the contribution that comes from the implicit prior embedded in the parameter-function map. For example, if you look at Figure 6 in the Towards Data Science post, you can see that different versions of SGD give a slightly different inductive bias. It’s also well-known that the inductive bias of neural networks is affected by things like how long you train for, and what step size and batch size you use, etc. However, these effects seem to be quite small compared to the effect that comes from the parameter-function map.
I should also note that I think that the fact that Gaussian processes even work at all already in itself gives us a fairly good reason to expect them to capture most of what makes NNs work in practice. For any given function approximator, if that function approximator is highly expressive then the “null hypothesis” should be that it basically won’t generalise at all. The fact that NNs and GPs both work, and the fact that there is a fairly strong correspondence between them, means that it would be kind of weird if they worked for completely different reasons.
I’d be interested to hear your take on why this means NN’s are only doing interpolation. What does it mean, to only do interpolation and not extrapolation? I know the toy model definitions of those terms (connecting dots vs. drawing a line off away from your dots) but what does it mean in real life problems? It seems like a fuzzy/graded distinction to me, at best.
Also, if the simplicity prior they use is akin to kolmogorov complexity-based priors, then that means what they are doing is akin to what e.g. Solomonoff Induction does. And I’ve never heard anyone try to argue that Solomonoff Induction “merely interpolates” before!
Yes, I agree with this.
I should note that SGD definitely does make a contribution to the inductive bias, but this contribution does seem to be quite small compared to the contribution that comes from the implicit prior embedded in the parameter-function map. For example, if you look at Figure 6 in the Towards Data Science post, you can see that different versions of SGD give a slightly different inductive bias. It’s also well-known that the inductive bias of neural networks is affected by things like how long you train for, and what step size and batch size you use, etc. However, these effects seem to be quite small compared to the effect that comes from the parameter-function map.
I should also note that I think that the fact that Gaussian processes even work at all already in itself gives us a fairly good reason to expect them to capture most of what makes NNs work in practice. For any given function approximator, if that function approximator is highly expressive then the “null hypothesis” should be that it basically won’t generalise at all. The fact that NNs and GPs both work, and the fact that there is a fairly strong correspondence between them, means that it would be kind of weird if they worked for completely different reasons.
I’d be interested to hear your take on why this means NN’s are only doing interpolation. What does it mean, to only do interpolation and not extrapolation? I know the toy model definitions of those terms (connecting dots vs. drawing a line off away from your dots) but what does it mean in real life problems? It seems like a fuzzy/graded distinction to me, at best.
Also, if the simplicity prior they use is akin to kolmogorov complexity-based priors, then that means what they are doing is akin to what e.g. Solomonoff Induction does. And I’ve never heard anyone try to argue that Solomonoff Induction “merely interpolates” before!