It is an indisputable fact that Watanabe proves (and I elaborate on in the post) that this singularity structure plays a central role in Bayesian generalisation
No. What was proven is that there are some points which can be represented by lots of possible configurations, more so than other points. There is no proof or even evidence that those are reached by NN training by SGD, or that those points represent good solutions to problems. As far as I can tell, you’re just assuming that because it seems to you like a logical reason for NN generalization.
With all due respect, I think you are misrepresenting what I am saying here. The sentence after your quote ends is
its relation to SGD dynamics is certainly an open question.
What is proven by Watanabe is that the Bayesian generalisation error, as I described in detail in the post, strongly depends on the singularity structure of the minima of K(w), as measured by the RLCT λ. This fact is proven in [Wat13] and explained in more detail in [Wat18]. As I elaborate on in the post, translating this statement into the SGD / frequentist setting is an interesting and important open problem, if it can be done at all.
the Bayes generalisation error Gn is the “derivative” of the free energy
I think calling that “Bayes generalisation error” is where you went wrong. I see no good basis for saying that’s true in the sense people normally mean “generalization”.
I understand some things about a Free Energy Formula are proved, but I don’t think you’ve shown anything about low RLCT points tending to be the sort of useful solutions which neural networks find.
Thanks for writing that, I look forward to reading.
As for nomenclature, I did not define it—the sequence is called Distilling SLT, and this is the definition offered by Watanabe. But to add some weight to it, the point is that in the Bayesian setting, the predictive distribution is a reasonable object to study from the point of view of generalisation, because it says: “what is the probability of this output given this input and given the data of the posterior”. The Bayes training loss Tn (which I haven’t delved into in this post) is the empirical counterpart to the Bayes generalisation loss,
Tn=−1n∑ni=1logp(yi|xi,Dn)
and so it adds up the “entropy” of the predictive distribution over the training datapoints Dn - if the predictive distribution is certain about all training datapoints, then the training loss will be 0. (Admittedly, this is an object that I am still getting my head around).
The Bayes generalisation loss satisfies Gn=EX[Tn], and therefore it averages this training loss over the whole space of inputs and outputs. This is the sense in which it is reasonable to call it “generalisation”. As I say in the post, there are other ways you can think of generalisation in the Bayesian setting, like the leave-one-out-cross-validation, or other methods of extracting data from the posterior like Gibbs sampling. Watanabe shows (see pg236 of [Wat18]) that the RLCT is a central object in all of these alternative conceptions.
As for the last point, neural networks do not “find” anything themselves—either an optimisation method like SGD does, or the Bayesian posterior “finds” regions of high posterior concentration (i.e. it contains this information). SLT tells us that the Bayesian posterior of singular models does “find” low RLCT points (as long as they are sufficiently accurate). Neural networks are singular models (as I will explain in DSLT3), so the posterior of neural networks “finds” low RLCT points.
Does this mean SGD does? We don’t know yet! And whether your intuition says this is a fruitful line of research to investigate is completely personal to your own mental model of the world, I suppose.
I guess I’ll write a post.
No. What was proven is that there are some points which can be represented by lots of possible configurations, more so than other points. There is no proof or even evidence that those are reached by NN training by SGD, or that those points represent good solutions to problems. As far as I can tell, you’re just assuming that because it seems to you like a logical reason for NN generalization.
With all due respect, I think you are misrepresenting what I am saying here. The sentence after your quote ends is
What is proven by Watanabe is that the Bayesian generalisation error, as I described in detail in the post, strongly depends on the singularity structure of the minima of K(w), as measured by the RLCT λ. This fact is proven in [Wat13] and explained in more detail in [Wat18]. As I elaborate on in the post, translating this statement into the SGD / frequentist setting is an interesting and important open problem, if it can be done at all.
I said I’d write a post, and I wrote a post.
I think calling that “Bayes generalisation error” is where you went wrong. I see no good basis for saying that’s true in the sense people normally mean “generalization”.
I understand some things about a Free Energy Formula are proved, but I don’t think you’ve shown anything about low RLCT points tending to be the sort of useful solutions which neural networks find.
Thanks for writing that, I look forward to reading.
As for nomenclature, I did not define it—the sequence is called Distilling SLT, and this is the definition offered by Watanabe. But to add some weight to it, the point is that in the Bayesian setting, the predictive distribution is a reasonable object to study from the point of view of generalisation, because it says: “what is the probability of this output given this input and given the data of the posterior”. The Bayes training loss Tn (which I haven’t delved into in this post) is the empirical counterpart to the Bayes generalisation loss,
Tn=−1n∑ni=1logp(yi|xi,Dn)
and so it adds up the “entropy” of the predictive distribution over the training datapoints Dn - if the predictive distribution is certain about all training datapoints, then the training loss will be 0. (Admittedly, this is an object that I am still getting my head around).
The Bayes generalisation loss satisfies Gn=EX[Tn], and therefore it averages this training loss over the whole space of inputs and outputs. This is the sense in which it is reasonable to call it “generalisation”. As I say in the post, there are other ways you can think of generalisation in the Bayesian setting, like the leave-one-out-cross-validation, or other methods of extracting data from the posterior like Gibbs sampling. Watanabe shows (see pg236 of [Wat18]) that the RLCT is a central object in all of these alternative conceptions.
As for the last point, neural networks do not “find” anything themselves—either an optimisation method like SGD does, or the Bayesian posterior “finds” regions of high posterior concentration (i.e. it contains this information). SLT tells us that the Bayesian posterior of singular models does “find” low RLCT points (as long as they are sufficiently accurate). Neural networks are singular models (as I will explain in DSLT3), so the posterior of neural networks “finds” low RLCT points.
Does this mean SGD does? We don’t know yet! And whether your intuition says this is a fruitful line of research to investigate is completely personal to your own mental model of the world, I suppose.