but large volume-->simple is what is proven in these papers(plus some empirical evidence of unclear import)
Is that the empirical evidence attempts to demonstrate simple --> large volume but is inconclusive, or is it that the empirical evidence does not even attempt to demonstrate simple --> large volume?
The evidence is empirical performance of Gaussian processes being similar to neural nets on simple tasks.
Well they do take many samples from what they call P_SGD and P_B and compare these as distributions, so it seems a little unfair to say that the evidence is that the performance is similar, since that would suggest that they were just comparing max performance by SGD to max performance by NNGP.
Re breaking down on more difficult problems: yes, I agree, we will have to wait and see and we shouldn’t be too optimistic given the paper you point to in your own post.
[from your post on NTK/GP not learning features] it seems possible that they [neural networks] could be doing something much more interesting—perhaps even implementing something like a simplicity prior over a large class of functions, which I’m pretty sure NTK/GP can’t be
It sounds like you do think there is some chance that neural network generalization is due to an architectural bias towards simplicity. I would be very interested in your take on other (non-Mingard) research on this overall question if you have time to jot down some notes.
Is that the empirical evidence attempts to demonstrate simple --> large volume but is inconclusive, or is it that the empirical evidence does not even attempt to demonstrate simple --> large volume?
They don’t really try to show simple --> large volume. They show is that there is substantial ’clustering, so some simple functions have large volume. I like nostalgebraist’s remarks on their clustering measures.
so it seems a little unfair to say that the evidence is that the performance is similar, since that would suggest that they were just comparing max performance by SGD to max performance by NNGP.
Fair point, they do compare the distributions as well. I don’t think it’s too surprising that they’re similar since they compare them on the test points of the distribution which they were trained to fit.
It sounds like you do think there is some chance that neural network generalization is due to an architectural bias towards simplicity
I do, although I’m not sure if I would say ‘architectural bias’ since I think SGD might play an important role. Unfortunately I don’t really have too much substantial to say about this—Mingard is the only researcher I’m aware of explicitly trying to link networks to simplicity priors. I think the most promising way to make progress here is likely to be analyzing neural nets in some non-kernel limit like Greg Yang’s work or this paper.
I think a lot of the points you raise here have good answers at https://www.alignmentforum.org/posts/YSFJosoHYFyXjoYWa/why-neural-networks-generalise-and-why-they-are-kind-of—see in particular replies by Joar Skalse (the author of that post). You say that you don’t think it surprising that the posteriors of NNs are similar to NNGPs on the data on which they were trained to fit—I think this statement is only unsurprising if you assume that SGD is not playing a particularly big role in the inductive bias (for small/medium scale datasets and architectures). In the main paper https://jmlr.org/papers/v22/20-676.html we do review a substantial amount of literature on topic. Some results that rely on “different hyperparameters result in different generalisation” type arguments were found later to be due to different effective training times (see Hoffer et al 2017). We also show that optimiser hyperparameter tuning can affect the generalisation—although in a fashion similar to changing the temperature in fully tempered posteriors (see eqn 1 in https://openreview.net/pdf?id=cu6zDHCfhZx) - in other words, still fundamentally due to the architecture.
Beyond the pretty conclusive evidence that SGD is a much smaller source of inductive bias than the architecture on small/medium scale tasks, I think there is a lot of evidence that the architecture is responsible for the first-order generalisation capabilities of the network elsewhere. For example, https://arxiv.org/abs/2012.04115 shows that architecture-only bounds are excellent predictors of performance on SOTA networks (e.g. wide resnets), as does https://arxiv.org/pdf/2002.02561.pdf (from a different group). For more circumstantial evidence, it is known that CNNs typically outperform fully connected nets for image classification, and transformers outperform lstms for sentiment analysis etc, even though the same type of optimisers are used.
I think there are very interesting questions remaining about the role of the optimiser in narrow networks, feature learning and very large scale models. Clearly though, the methods we used on the small/medium scale architectures and datasets will not scale to these questions without some major changes. For the meantime, we are using current methods to investigate some edge cases, none of which are yet to show strong deviation from our predictions.
I would suggest that the architecture being the main source of inductive bias might be a sensible null hypothesis for the cases we are yet to directly probe. I also think that the comparative simplicity of the hypothesis—that SGD finds functions with probability proportional to their volume in parameter space/performs random sampling (very closely when there is strong bias in the parameter-function map and progressively less closely the weaker it gets), and a strong architectural bias towards simplicity (again with some subtleties) causes the good generalisation—is quite compelling.
I think we basically agree on the state of the empirical evidence—the question is just whether NTK/GP/random-sampling methods will continue to match the performance of SGD-trained nets on more complex problems, or if they’ll break down, ultimately being a first-order approximation to some more complex dynamics. I think the latter is more likely, mostly based on the lack of feature learning in NTK/GP/random limits.
re: the architecture being the source of inductive bias—I certainly think this is true in the sense that architecture choice will have a bigger effect on generalization than hyperparameters, or the choice of which local optimizer to use. But I do think that using a local optimizer at all, as opposed to randomly sampling parameters, is likely to have a large effect.
Thanks for these pointers.
Is that the empirical evidence attempts to demonstrate simple --> large volume but is inconclusive, or is it that the empirical evidence does not even attempt to demonstrate simple --> large volume?
Well they do take many samples from what they call P_SGD and P_B and compare these as distributions, so it seems a little unfair to say that the evidence is that the performance is similar, since that would suggest that they were just comparing max performance by SGD to max performance by NNGP.
Re breaking down on more difficult problems: yes, I agree, we will have to wait and see and we shouldn’t be too optimistic given the paper you point to in your own post.
It sounds like you do think there is some chance that neural network generalization is due to an architectural bias towards simplicity. I would be very interested in your take on other (non-Mingard) research on this overall question if you have time to jot down some notes.
They don’t really try to show simple --> large volume. They show is that there is substantial ’clustering, so some simple functions have large volume. I like nostalgebraist’s remarks on their clustering measures.
Fair point, they do compare the distributions as well. I don’t think it’s too surprising that they’re similar since they compare them on the test points of the distribution which they were trained to fit.
I do, although I’m not sure if I would say ‘architectural bias’ since I think SGD might play an important role. Unfortunately I don’t really have too much substantial to say about this—Mingard is the only researcher I’m aware of explicitly trying to link networks to simplicity priors. I think the most promising way to make progress here is likely to be analyzing neural nets in some non-kernel limit like Greg Yang’s work or this paper.
I think a lot of the points you raise here have good answers at https://www.alignmentforum.org/posts/YSFJosoHYFyXjoYWa/why-neural-networks-generalise-and-why-they-are-kind-of—see in particular replies by Joar Skalse (the author of that post). You say that you don’t think it surprising that the posteriors of NNs are similar to NNGPs on the data on which they were trained to fit—I think this statement is only unsurprising if you assume that SGD is not playing a particularly big role in the inductive bias (for small/medium scale datasets and architectures). In the main paper https://jmlr.org/papers/v22/20-676.html we do review a substantial amount of literature on topic. Some results that rely on “different hyperparameters result in different generalisation” type arguments were found later to be due to different effective training times (see Hoffer et al 2017). We also show that optimiser hyperparameter tuning can affect the generalisation—although in a fashion similar to changing the temperature in fully tempered posteriors (see eqn 1 in https://openreview.net/pdf?id=cu6zDHCfhZx) - in other words, still fundamentally due to the architecture.
Beyond the pretty conclusive evidence that SGD is a much smaller source of inductive bias than the architecture on small/medium scale tasks, I think there is a lot of evidence that the architecture is responsible for the first-order generalisation capabilities of the network elsewhere. For example, https://arxiv.org/abs/2012.04115 shows that architecture-only bounds are excellent predictors of performance on SOTA networks (e.g. wide resnets), as does https://arxiv.org/pdf/2002.02561.pdf (from a different group). For more circumstantial evidence, it is known that CNNs typically outperform fully connected nets for image classification, and transformers outperform lstms for sentiment analysis etc, even though the same type of optimisers are used.
I think there are very interesting questions remaining about the role of the optimiser in narrow networks, feature learning and very large scale models. Clearly though, the methods we used on the small/medium scale architectures and datasets will not scale to these questions without some major changes. For the meantime, we are using current methods to investigate some edge cases, none of which are yet to show strong deviation from our predictions.
I would suggest that the architecture being the main source of inductive bias might be a sensible null hypothesis for the cases we are yet to directly probe. I also think that the comparative simplicity of the hypothesis—that SGD finds functions with probability proportional to their volume in parameter space/performs random sampling (very closely when there is strong bias in the parameter-function map and progressively less closely the weaker it gets), and a strong architectural bias towards simplicity (again with some subtleties) causes the good generalisation—is quite compelling.
I think we basically agree on the state of the empirical evidence—the question is just whether NTK/GP/random-sampling methods will continue to match the performance of SGD-trained nets on more complex problems, or if they’ll break down, ultimately being a first-order approximation to some more complex dynamics. I think the latter is more likely, mostly based on the lack of feature learning in NTK/GP/random limits.
re: the architecture being the source of inductive bias—I certainly think this is true in the sense that architecture choice will have a bigger effect on generalization than hyperparameters, or the choice of which local optimizer to use. But I do think that using a local optimizer at all, as opposed to randomly sampling parameters, is likely to have a large effect.