A comment by Patrick Foré, professor from University of Amsterdam
I’m a bit puzzled by this post.The RLCT is a function of (q,p,m), the sampling distribution q, prior distribution p and parameter-to-distribution map m of the statistical model. So it actually takes the parameter-to-distribution map m into account. However, it was criticised that: “3. SLT abstracts away from both the prior and the parameter-function map.”Also, in the chapter “Why the SLT Answer Fails” a rather complex parameterization of a model is constructed, where SLT then supposedly fails, by pointing to a rather complex (in contrast to a simple) model. But this is, of course, because SLT takes the (here complicated) parameter-to-distribution map m into account. So it is unclear if the criticism now is that SLT actually takes m into account (but in a complicated way) or that it doesn’t …It was also said: “4. Hence, SLT is at its core unable to explain generalisation behaviour.” “SLT does not explain generalisation in neural networks.”It was shown in Watanabe’s grey book (but there still under a bit more restrictive assumptions), see Remark 6.7 (3) that the Bayes generalization error of Bayesian learning is asymptotically, for sample size n to infinity, given by: E[Bg] := E[KL(q(x)||p(x|Dn))] ~ RLCT/nSo SLT actually does say something about generalization in the Bayesian learning setting and it is a very satisfying answer imho (similar to what VC-dimension says about binary classification, but where RLCT is defined much more generally and does not just depend on the function class, but on the whole triple (q,p,m) and says something for the average case and not just for the worst case scenario).Of course, usually people don’t do proper Bayesian deep learning (they usually do MLE/MAP estimation with SGD) and they also plot a different type of generalization error and are interested in different aspects (e.g. finite sample generalization, double-descent behaviour, etc.) But this gap could be mentioned in the very beginning of the post (maybe even in a table ‘what we want’ vs ‘what SLT currently says’) and then it would be less surprising that SLT says something about something else than what (most/some) people are interested in.Certainly, what is written under “The Actual Solution” is closer to how deep learning is done in practice. However, this is also an investigation into learning theory for singular models (that is not focused on RLCT), so can also be considered a part of SLT. Furthermore, nothing prevents us from investigating if and how it relates to quantities like RLCT, singular fluctuation, etc. (e.g. if it is providing upper or lower bounds on such quantities).Maybe the title of the post “My Criticism of Singular Learning Theory” should be replaced by “The deep learning community is interested in something else than what Singular Learning Theory currently provides”
A comment by Patrick Foré, professor from University of Amsterdam