Liam Carroll comments on DSLT 2. Why Neural Networks obey Occam’s Razor

Liam Carroll 19 Jun 2023 1:01 UTC
15 points
0
Thanks for writing that, I look forward to reading.
As for nomenclature, I did not define it—the sequence is called Distilling SLT, and this is the definition offered by Watanabe. But to add some weight to it, the point is that in the Bayesian setting, the predictive distribution is a reasonable object to study from the point of view of generalisation, because it says: “what is the probability of this output given this input and given the data of the posterior”. The Bayes training loss $T_{n}$ (which I haven’t delved into in this post) is the empirical counterpart to the Bayes generalisation loss,
$T_{n} = - \frac{1}{n} \sum_{i = 1}^{n} log p (y_{i} | x_{i}, D_{n})$
and so it adds up the “entropy” of the predictive distribution over the training datapoints $D_{n}$ - if the predictive distribution is certain about all training datapoints, then the training loss will be 0. (Admittedly, this is an object that I am still getting my head around).
The Bayes generalisation loss satisfies $G_{n} = E_{X} [T_{n}]$ , and therefore it averages this training loss over the whole space of inputs and outputs. This is the sense in which it is reasonable to call it “generalisation”. As I say in the post, there are other ways you can think of generalisation in the Bayesian setting, like the leave-one-out-cross-validation, or other methods of extracting data from the posterior like Gibbs sampling. Watanabe shows (see pg236 of [Wat18]) that the RLCT is a central object in all of these alternative conceptions.
As for the last point, neural networks do not “find” anything themselves—either an optimisation method like SGD does, or the Bayesian posterior “finds” regions of high posterior concentration (i.e. it contains this information). SLT tells us that the Bayesian posterior of singular models does “find” low RLCT points (as long as they are sufficiently accurate). Neural networks are singular models (as I will explain in DSLT3), so the posterior of neural networks “finds” low RLCT points.
Does this mean SGD does? We don’t know yet! And whether your intuition says this is a fruitful line of research to investigate is completely personal to your own mental model of the world, I suppose.