I think a lot of the points you raise here have good answers at https://www.alignmentforum.org/posts/YSFJosoHYFyXjoYWa/why-neural-networks-generalise-and-why-they-are-kind-of—see in particular replies by Joar Skalse (the author of that post). You say that you don’t think it surprising that the posteriors of NNs are similar to NNGPs on the data on which they were trained to fit—I think this statement is only unsurprising if you assume that SGD is not playing a particularly big role in the inductive bias (for small/medium scale datasets and architectures). In the main paper https://jmlr.org/papers/v22/20-676.html we do review a substantial amount of literature on topic. Some results that rely on “different hyperparameters result in different generalisation” type arguments were found later to be due to different effective training times (see Hoffer et al 2017). We also show that optimiser hyperparameter tuning can affect the generalisation—although in a fashion similar to changing the temperature in fully tempered posteriors (see eqn 1 in https://openreview.net/pdf?id=cu6zDHCfhZx) - in other words, still fundamentally due to the architecture.
Beyond the pretty conclusive evidence that SGD is a much smaller source of inductive bias than the architecture on small/medium scale tasks, I think there is a lot of evidence that the architecture is responsible for the first-order generalisation capabilities of the network elsewhere. For example, https://arxiv.org/abs/2012.04115 shows that architecture-only bounds are excellent predictors of performance on SOTA networks (e.g. wide resnets), as does https://arxiv.org/pdf/2002.02561.pdf (from a different group). For more circumstantial evidence, it is known that CNNs typically outperform fully connected nets for image classification, and transformers outperform lstms for sentiment analysis etc, even though the same type of optimisers are used.
I think there are very interesting questions remaining about the role of the optimiser in narrow networks, feature learning and very large scale models. Clearly though, the methods we used on the small/medium scale architectures and datasets will not scale to these questions without some major changes. For the meantime, we are using current methods to investigate some edge cases, none of which are yet to show strong deviation from our predictions.
I would suggest that the architecture being the main source of inductive bias might be a sensible null hypothesis for the cases we are yet to directly probe. I also think that the comparative simplicity of the hypothesis—that SGD finds functions with probability proportional to their volume in parameter space/performs random sampling (very closely when there is strong bias in the parameter-function map and progressively less closely the weaker it gets), and a strong architectural bias towards simplicity (again with some subtleties) causes the good generalisation—is quite compelling.
I think we basically agree on the state of the empirical evidence—the question is just whether NTK/GP/random-sampling methods will continue to match the performance of SGD-trained nets on more complex problems, or if they’ll break down, ultimately being a first-order approximation to some more complex dynamics. I think the latter is more likely, mostly based on the lack of feature learning in NTK/GP/random limits.
re: the architecture being the source of inductive bias—I certainly think this is true in the sense that architecture choice will have a bigger effect on generalization than hyperparameters, or the choice of which local optimizer to use. But I do think that using a local optimizer at all, as opposed to randomly sampling parameters, is likely to have a large effect.
I think a lot of the points you raise here have good answers at https://www.alignmentforum.org/posts/YSFJosoHYFyXjoYWa/why-neural-networks-generalise-and-why-they-are-kind-of—see in particular replies by Joar Skalse (the author of that post). You say that you don’t think it surprising that the posteriors of NNs are similar to NNGPs on the data on which they were trained to fit—I think this statement is only unsurprising if you assume that SGD is not playing a particularly big role in the inductive bias (for small/medium scale datasets and architectures). In the main paper https://jmlr.org/papers/v22/20-676.html we do review a substantial amount of literature on topic. Some results that rely on “different hyperparameters result in different generalisation” type arguments were found later to be due to different effective training times (see Hoffer et al 2017). We also show that optimiser hyperparameter tuning can affect the generalisation—although in a fashion similar to changing the temperature in fully tempered posteriors (see eqn 1 in https://openreview.net/pdf?id=cu6zDHCfhZx) - in other words, still fundamentally due to the architecture.
Beyond the pretty conclusive evidence that SGD is a much smaller source of inductive bias than the architecture on small/medium scale tasks, I think there is a lot of evidence that the architecture is responsible for the first-order generalisation capabilities of the network elsewhere. For example, https://arxiv.org/abs/2012.04115 shows that architecture-only bounds are excellent predictors of performance on SOTA networks (e.g. wide resnets), as does https://arxiv.org/pdf/2002.02561.pdf (from a different group). For more circumstantial evidence, it is known that CNNs typically outperform fully connected nets for image classification, and transformers outperform lstms for sentiment analysis etc, even though the same type of optimisers are used.
I think there are very interesting questions remaining about the role of the optimiser in narrow networks, feature learning and very large scale models. Clearly though, the methods we used on the small/medium scale architectures and datasets will not scale to these questions without some major changes. For the meantime, we are using current methods to investigate some edge cases, none of which are yet to show strong deviation from our predictions.
I would suggest that the architecture being the main source of inductive bias might be a sensible null hypothesis for the cases we are yet to directly probe. I also think that the comparative simplicity of the hypothesis—that SGD finds functions with probability proportional to their volume in parameter space/performs random sampling (very closely when there is strong bias in the parameter-function map and progressively less closely the weaker it gets), and a strong architectural bias towards simplicity (again with some subtleties) causes the good generalisation—is quite compelling.
I think we basically agree on the state of the empirical evidence—the question is just whether NTK/GP/random-sampling methods will continue to match the performance of SGD-trained nets on more complex problems, or if they’ll break down, ultimately being a first-order approximation to some more complex dynamics. I think the latter is more likely, mostly based on the lack of feature learning in NTK/GP/random limits.
re: the architecture being the source of inductive bias—I certainly think this is true in the sense that architecture choice will have a bigger effect on generalization than hyperparameters, or the choice of which local optimizer to use. But I do think that using a local optimizer at all, as opposed to randomly sampling parameters, is likely to have a large effect.