Trained neural networks only generalize at all with certain types of activation functions. So if your theory doesn’t consider activation functions it’s probably wrong.
If a model is singular, then Watanabe’s Free Energy Formula (FEF) can have big implications for the geometry of the loss landscape. Whether or not a particular neural network model is singular does indeed depend on its activation function, amongst other structures in its architecture.
In DSLT3 I will outline the ways simple two layer feedforward ReLU neural networks are singular models (ie I will show the symmetries in parameter space that produce the same input-output function), which is generalisable to deeper feedforward ReLU networks. There I will also discuss similar results for tanh networks, alluding to the fact that there are many (but not all) activation functions that produce these symmetries, thus making neural networks with those activation functions singular models, thus meaning the content and interpretation of Watanabe’s free energy formula is applicable.
This is all pretty complicated compared to my understanding of why neural networks generalize, and I’m not sure why I should prefer it. Does this complex and detailed theory have any concrete predictions about NN design or performance in different circumstances? Can you accurately predict which activation functions work well?
My view is that this “singularity” of networks—which I don’t think is a good term, it’s already overloaded with far too many meanings—is applicable to convergence properties but not to generalization ability.
What is your understanding? It is indeed a deep mathematical theory, but it is not convoluted. Watanabe proves the FEF, and shows the RLCT is the natural generalisation of complexity in this setting. There is a long history of deep/complicated mathematics, with natural (and beautiful) theorems at the core, being pivotal to describing real world phenomena.
The point of the posts is not to argue that we can prove why particular architectures perform better than others (yet). This field has had, comparatively, very little work done to it yet within AI research, and these sorts of facts are where SLT might take us (modulo AI capabilities concerns). The point is to demonstrate the key insights of the theory and signpost the fact that “hey, there might be something very meaningful here.” What we can predict with the theory is why certain phase transitions happen, in particular the two layer feedforward ReLU nets I will show in DSLT4. This is a seed from which to generalise to deeper nets and more intricate architectures—the natural way of doing good mathematics.
As to the “singularity” problem, you will have to take that up with the algebraic geometers who have been studying singularities for over 50 years. The fact is that optimal parameters are singularities of K(w) in non-trivial neural networks—hence, singular learning theory.
What do “convergence properties” and “generalisation ability“ mean to you, precisely? It is an indisputable fact that Watanabe proves (and I elaborate on in the post) that this singularity structure plays a central role in Bayesian generalisation. As I say, its relation to SGD dynamics is certainly an open question. But in the Bayesian setting, the case is really quite closed.
It is an indisputable fact that Watanabe proves (and I elaborate on in the post) that this singularity structure plays a central role in Bayesian generalisation
No. What was proven is that there are some points which can be represented by lots of possible configurations, more so than other points. There is no proof or even evidence that those are reached by NN training by SGD, or that those points represent good solutions to problems. As far as I can tell, you’re just assuming that because it seems to you like a logical reason for NN generalization.
With all due respect, I think you are misrepresenting what I am saying here. The sentence after your quote ends is
its relation to SGD dynamics is certainly an open question.
What is proven by Watanabe is that the Bayesian generalisation error, as I described in detail in the post, strongly depends on the singularity structure of the minima of K(w), as measured by the RLCT λ. This fact is proven in [Wat13] and explained in more detail in [Wat18]. As I elaborate on in the post, translating this statement into the SGD / frequentist setting is an interesting and important open problem, if it can be done at all.
the Bayes generalisation error Gn is the “derivative” of the free energy
I think calling that “Bayes generalisation error” is where you went wrong. I see no good basis for saying that’s true in the sense people normally mean “generalization”.
I understand some things about a Free Energy Formula are proved, but I don’t think you’ve shown anything about low RLCT points tending to be the sort of useful solutions which neural networks find.
Thanks for writing that, I look forward to reading.
As for nomenclature, I did not define it—the sequence is called Distilling SLT, and this is the definition offered by Watanabe. But to add some weight to it, the point is that in the Bayesian setting, the predictive distribution is a reasonable object to study from the point of view of generalisation, because it says: “what is the probability of this output given this input and given the data of the posterior”. The Bayes training loss Tn (which I haven’t delved into in this post) is the empirical counterpart to the Bayes generalisation loss,
Tn=−1n∑ni=1logp(yi|xi,Dn)
and so it adds up the “entropy” of the predictive distribution over the training datapoints Dn - if the predictive distribution is certain about all training datapoints, then the training loss will be 0. (Admittedly, this is an object that I am still getting my head around).
The Bayes generalisation loss satisfies Gn=EX[Tn], and therefore it averages this training loss over the whole space of inputs and outputs. This is the sense in which it is reasonable to call it “generalisation”. As I say in the post, there are other ways you can think of generalisation in the Bayesian setting, like the leave-one-out-cross-validation, or other methods of extracting data from the posterior like Gibbs sampling. Watanabe shows (see pg236 of [Wat18]) that the RLCT is a central object in all of these alternative conceptions.
As for the last point, neural networks do not “find” anything themselves—either an optimisation method like SGD does, or the Bayesian posterior “finds” regions of high posterior concentration (i.e. it contains this information). SLT tells us that the Bayesian posterior of singular models does “find” low RLCT points (as long as they are sufficiently accurate). Neural networks are singular models (as I will explain in DSLT3), so the posterior of neural networks “finds” low RLCT points.
Does this mean SGD does? We don’t know yet! And whether your intuition says this is a fruitful line of research to investigate is completely personal to your own mental model of the world, I suppose.
Trained neural networks only generalize at all with certain types of activation functions. So if your theory doesn’t consider activation functions it’s probably wrong.
If a model is singular, then Watanabe’s Free Energy Formula (FEF) can have big implications for the geometry of the loss landscape. Whether or not a particular neural network model is singular does indeed depend on its activation function, amongst other structures in its architecture.
In DSLT3 I will outline the ways simple two layer feedforward ReLU neural networks are singular models (ie I will show the symmetries in parameter space that produce the same input-output function), which is generalisable to deeper feedforward ReLU networks. There I will also discuss similar results for tanh networks, alluding to the fact that there are many (but not all) activation functions that produce these symmetries, thus making neural networks with those activation functions singular models, thus meaning the content and interpretation of Watanabe’s free energy formula is applicable.
This is all pretty complicated compared to my understanding of why neural networks generalize, and I’m not sure why I should prefer it. Does this complex and detailed theory have any concrete predictions about NN design or performance in different circumstances? Can you accurately predict which activation functions work well?
My view is that this “singularity” of networks—which I don’t think is a good term, it’s already overloaded with far too many meanings—is applicable to convergence properties but not to generalization ability.
What is your understanding? It is indeed a deep mathematical theory, but it is not convoluted. Watanabe proves the FEF, and shows the RLCT is the natural generalisation of complexity in this setting. There is a long history of deep/complicated mathematics, with natural (and beautiful) theorems at the core, being pivotal to describing real world phenomena.
The point of the posts is not to argue that we can prove why particular architectures perform better than others (yet). This field has had, comparatively, very little work done to it yet within AI research, and these sorts of facts are where SLT might take us (modulo AI capabilities concerns). The point is to demonstrate the key insights of the theory and signpost the fact that “hey, there might be something very meaningful here.” What we can predict with the theory is why certain phase transitions happen, in particular the two layer feedforward ReLU nets I will show in DSLT4. This is a seed from which to generalise to deeper nets and more intricate architectures—the natural way of doing good mathematics.
As to the “singularity” problem, you will have to take that up with the algebraic geometers who have been studying singularities for over 50 years. The fact is that optimal parameters are singularities of K(w) in non-trivial neural networks—hence, singular learning theory.
What do “convergence properties” and “generalisation ability“ mean to you, precisely? It is an indisputable fact that Watanabe proves (and I elaborate on in the post) that this singularity structure plays a central role in Bayesian generalisation. As I say, its relation to SGD dynamics is certainly an open question. But in the Bayesian setting, the case is really quite closed.
I guess I’ll write a post.
No. What was proven is that there are some points which can be represented by lots of possible configurations, more so than other points. There is no proof or even evidence that those are reached by NN training by SGD, or that those points represent good solutions to problems. As far as I can tell, you’re just assuming that because it seems to you like a logical reason for NN generalization.
With all due respect, I think you are misrepresenting what I am saying here. The sentence after your quote ends is
What is proven by Watanabe is that the Bayesian generalisation error, as I described in detail in the post, strongly depends on the singularity structure of the minima of K(w), as measured by the RLCT λ. This fact is proven in [Wat13] and explained in more detail in [Wat18]. As I elaborate on in the post, translating this statement into the SGD / frequentist setting is an interesting and important open problem, if it can be done at all.
I said I’d write a post, and I wrote a post.
I think calling that “Bayes generalisation error” is where you went wrong. I see no good basis for saying that’s true in the sense people normally mean “generalization”.
I understand some things about a Free Energy Formula are proved, but I don’t think you’ve shown anything about low RLCT points tending to be the sort of useful solutions which neural networks find.
Thanks for writing that, I look forward to reading.
As for nomenclature, I did not define it—the sequence is called Distilling SLT, and this is the definition offered by Watanabe. But to add some weight to it, the point is that in the Bayesian setting, the predictive distribution is a reasonable object to study from the point of view of generalisation, because it says: “what is the probability of this output given this input and given the data of the posterior”. The Bayes training loss Tn (which I haven’t delved into in this post) is the empirical counterpart to the Bayes generalisation loss,
Tn=−1n∑ni=1logp(yi|xi,Dn)
and so it adds up the “entropy” of the predictive distribution over the training datapoints Dn - if the predictive distribution is certain about all training datapoints, then the training loss will be 0. (Admittedly, this is an object that I am still getting my head around).
The Bayes generalisation loss satisfies Gn=EX[Tn], and therefore it averages this training loss over the whole space of inputs and outputs. This is the sense in which it is reasonable to call it “generalisation”. As I say in the post, there are other ways you can think of generalisation in the Bayesian setting, like the leave-one-out-cross-validation, or other methods of extracting data from the posterior like Gibbs sampling. Watanabe shows (see pg236 of [Wat18]) that the RLCT is a central object in all of these alternative conceptions.
As for the last point, neural networks do not “find” anything themselves—either an optimisation method like SGD does, or the Bayesian posterior “finds” regions of high posterior concentration (i.e. it contains this information). SLT tells us that the Bayesian posterior of singular models does “find” low RLCT points (as long as they are sufficiently accurate). Neural networks are singular models (as I will explain in DSLT3), so the posterior of neural networks “finds” low RLCT points.
Does this mean SGD does? We don’t know yet! And whether your intuition says this is a fruitful line of research to investigate is completely personal to your own mental model of the world, I suppose.