Why Neural Networks Generalise, and Why They Are (Kind of) Bayesian
Currently, we do not have a good theoretical understanding of how or why neural networks actually work. For example, we know that large neural networks are sufficiently expressive to compute almost any kind of function. Moreover, most functions that fit a given set of training data will not generalise well to new data. And yet, if we train a neural network we will usually obtain a function that gives good generalisation. What is the mechanism behind this phenomenon?
There has been some recent research which (I believe) sheds some light on this issue. I would like to call attention to this blog post:
Neural Networks Are Fundamentally Bayesian
This post provides a summary of the research in these three papers, which provide a candidate for a theory of generalisation:
https://arxiv.org/abs/2006.15191
https://arxiv.org/abs/1909.11522
https://arxiv.org/abs/1805.08522
(You may notice that I had some involvement with this research, but the main credit should go to Chris Mingard and Guillermo Valle-Perez!)
I believe that research of this type is very relevant for AI alignment. It seems quite plausible that neural networks, or something similar to them, will be used as a component of AGI. If that is the case, then we want to be able to reliably predict and reason about how neural networks behave in new situations, and how they interact with other systems, and it is hard to imagine how that would be possible without a deep understanding of the dynamics at play when neural networks learn from data. Understanding their inductive bias seems particularly important, since this is the key to understanding everything from why they work in the first place, to phenomena such as adversarial examples, to the risk of mesa-optimisation. I hence believe that it makes sense for alignment researchers to keep an eye on what is happening in this space.
If you want some more stuff to read in this genre, I can also recommend these two posts:
Recent Progress in the Theory of Neural Networks
Understanding “Deep Double Descent”
EDIT: Here is a second post, which talks more about the “prior” of neural networks:
Deep Neural Networks are biased, at initialisation, towards simple functions
- Will Capabilities Generalise More? by 29 Jun 2022 17:12 UTC; 133 points) (
- Voting Results for the 2020 Review by 2 Feb 2022 18:37 UTC; 108 points) (
- How likely is deceptive alignment? by 30 Aug 2022 19:34 UTC; 104 points) (
- My Criticism of Singular Learning Theory by 19 Nov 2023 15:19 UTC; 83 points) (
- 2020 Review Article by 14 Jan 2022 4:58 UTC; 74 points) (
- Updating the Lottery Ticket Hypothesis by 18 Apr 2021 21:45 UTC; 73 points) (
- How To Think About Overparameterized Models by 3 Mar 2021 22:29 UTC; 64 points) (
- Conditioning Generative Models for Alignment by 18 Jul 2022 7:11 UTC; 59 points) (
- Epistemological Framing for AI Alignment Research by 8 Mar 2021 22:05 UTC; 58 points) (
- Deception?! I ain’t got time for that! by 18 Jul 2022 0:06 UTC; 55 points) (
- My Overview of the AI Alignment Landscape: Threat Models by 25 Dec 2021 23:07 UTC; 53 points) (
- Acceptability Verification: A Research Agenda by 12 Jul 2022 20:11 UTC; 50 points) (
- Towards Deconfusing Gradient Hacking by 24 Oct 2021 0:43 UTC; 39 points) (
- NTK/GP Models of Neural Nets Can’t Learn Features by 22 Apr 2021 3:01 UTC; 33 points) (
- Epistemic Strategies of Selection Theorems by 18 Oct 2021 8:57 UTC; 33 points) (
- [AN #139]: How the simplicity of reality explains the success of neural nets by 24 Feb 2021 18:30 UTC; 26 points) (
- Why you might expect homogeneous take-off: evidence from ML research by 17 Jul 2022 20:31 UTC; 24 points) (
- 19 Nov 2021 8:40 UTC; 23 points) 's comment on A positive case for how we might succeed at prosaic AI alignment by (
- Evidence Sets: Towards Inductive-Biases based Analysis of Prosaic AGI by 16 Dec 2021 22:41 UTC; 22 points) (
- Neural networks biased towards geometrically simple functions? by 8 Dec 2022 16:16 UTC; 16 points) (
- What Are Some Alternative Approaches to Understanding Agency/Intelligence? by 29 Dec 2020 23:21 UTC; 15 points) (
- 14 Jun 2021 7:00 UTC; 14 points) 's comment on Looking Deeper at Deconfusion by (
- 10 Mar 2021 9:29 UTC; 10 points) 's comment on Daniel Kokotajlo’s Shortform by (
- 11 May 2021 19:10 UTC; 6 points) 's comment on Parsing Chris Mingard on Neural Networks by (
- 12 May 2021 18:16 UTC; 5 points) 's comment on Parsing Chris Mingard on Neural Networks by (
- 18 Aug 2022 19:21 UTC; 5 points) 's comment on Bias towards simple functions; application to alignment? by (
- 11 May 2021 19:06 UTC; 3 points) 's comment on Parsing Chris Mingard on Neural Networks by (
- 21 Nov 2023 22:06 UTC; 3 points) 's comment on My Criticism of Singular Learning Theory by (
- 21 Nov 2023 23:08 UTC; 3 points) 's comment on My Criticism of Singular Learning Theory by (
The work linked in this post was IMO the most important work done on understanding neural networks at the time it came out, and it has also significantly changed the way I think about optimization more generally.
That said, there’s a lot of “noise” in the linked papers; it takes some digging to see the key ideas and the data backing them up, and there’s a lot of space spent on things which IMO just aren’t that interesting at all. So, I’ll summarize the things which I consider central.
When optimizing an overparameterized system, there are many many different parameter settings which achieve optimality. Optima are not peaks, they’re ridges; there’s a whole surface on which optimal performance is achieved. In this regime, the key question is which of the many optima an optimized system actually converges to.
Here’s a kind-of-silly way to model it. First, we sample some random point in parameter space from the distribution P[θ]; in the neural network case, this is the parameter initialization. Then, we optimize: we find some new parameter values θ′ such that f(θ′) is maximized. But which of the many optimal θ′ values does our optimizer end up at? If we didn’t know anything about the details of the optimizer, one simple guess would be that θ′ is sampled from the initialization distribution, but updated on the point being optimal, i.e.
θ′∼P[θ|f(θ) is maximal]=1ZI[f(θ) is maximal]P[θ]
… so the net effect of randomly initializing and then optimizing is equivalent to using the initialization distribution as a prior, doing a Bayesian update on θ′ being optimal, and then sampling from that posterior.
The linked papers show that this kind-of-silly model is basically accurate. It didn’t have to be this way a priori; we could imagine that the specifics of SGD favored some points over others, so that the distribution of θ′ was not proportional to the prior. But that mostly doesn’t happen (and to the extent it does, it’s a relatively weak effect); the data shows that θ′ values are sampled roughly in proportion to their density in the prior, exactly as we’d expect from the Bayesian-update-on-optimality model.
One implication of this is that the good generalization of neural nets must come mostly from the prior, not from some bias in SGD, because bias in SGD mostly just doesn’t impact the distribution of optimized parameters values. The optimized parameter value distribution is approximately-determined by the initialization prior, so any generalization must come from that prior. And indeed, the papers also confirm that the observed generalization error lines up with what we’d expect from the Bayesian-update-on-optimality model.
For me, the most important update from this work has not been specific to neural nets. It’s about overparameterized optimization in general: we can think of overparameterized optimization as sampling from the initialization prior updated on optimality, i.e. P[θ|f(θ) is maximal]. This is a great approximation to work with analytically, and the papers here show that it is realistic for real complicated systems like SGD-trained neural nets.