I would argue that the title is sufficiently ambiguous as to what is being claimed, and actually the point of contention in (ii) was discussed in the comments there too. I could have changed it to Why Neural Networks can obey Occam’s Razor, but I think this obscures the main point. Regular linear regression could also obey Occam’s razor (i.e. “simpler” models are possible) if you set high-order coefficients to 0, but the posterior of such models does not concentrate on those points in parameter space.
At the time of writing, basically nobody knew anything about SLT, so I think it was warranted to err on the side of grabbing attention in the introductory paragraphs and then explaining in detail further on with “we can now understand why singular models have the capacity to generalise well”, instead of caveating the whole topic out of existence before the reader knows what is going on.
As we discussed at Berkeley, I do like the polynomial example you give and this whole discussion has made me think more carefully about various aspects of the story, so thanks for that. My inclination is that the polynomial example is actually quite pathological and that there is a reasonable correlation between the RLCT and Kolmogorov complexity in practice (e.g. the one-node subnetwork preferred by the posterior compared to the two-node network in DSLT4), but I don’t know enough about Kolmogorov complexity to say much more than that.
I could have changed it to Why Neural Networks can obey Occam’s Razor, but I think this obscures the main point.
I think even this would be somewhat inaccurate (in my opinion). If a given parametric Bayesian learning machine does obey (some version of) Occam’s razor, then this must be because of some facts related to its prior, and because of some facts related to its parameter-function map. SLT does not say very much about either of these two things. What the post is about is primarily the relationship between the RLCT and posterior probability, and how this relationship can be used to reason about training dynamics. To connect this to Occam’s razor (or inductive bias more broadly), further assumptions and claims would be required.
At the time of writing, basically nobody knew anything about SLT
Yes, thank you so much for taking the time to write those posts! They were very helpful for me to learn the basics of SLT.
As we discussed at Berkeley, I do like the polynomial example you give and this whole discussion has made me think more carefully about various aspects of the story, so thanks for that.
I’m very glad to hear that! :)
My inclination is that the polynomial example is actually quite pathological and that there is a reasonable correlation between the RLCT and Kolmogorov complexity in practice
Yes, I also believe that! The polynomial example is definitely pathological, and I do think that low λ almost certainly is correlated with simplicity in the case of neural networks. My point is more that the mathematics of SLT does not explain generalisation, and that additional assumptions definitely will be needed to derive specific claims about the inductive bias of neural networks.
I would argue that the title is sufficiently ambiguous as to what is being claimed, and actually the point of contention in (ii) was discussed in the comments there too. I could have changed it to Why Neural Networks can obey Occam’s Razor, but I think this obscures the main point. Regular linear regression could also obey Occam’s razor (i.e. “simpler” models are possible) if you set high-order coefficients to 0, but the posterior of such models does not concentrate on those points in parameter space.
At the time of writing, basically nobody knew anything about SLT, so I think it was warranted to err on the side of grabbing attention in the introductory paragraphs and then explaining in detail further on with “we can now understand why singular models have the capacity to generalise well”, instead of caveating the whole topic out of existence before the reader knows what is going on.
As we discussed at Berkeley, I do like the polynomial example you give and this whole discussion has made me think more carefully about various aspects of the story, so thanks for that. My inclination is that the polynomial example is actually quite pathological and that there is a reasonable correlation between the RLCT and Kolmogorov complexity in practice (e.g. the one-node subnetwork preferred by the posterior compared to the two-node network in DSLT4), but I don’t know enough about Kolmogorov complexity to say much more than that.
I think even this would be somewhat inaccurate (in my opinion). If a given parametric Bayesian learning machine does obey (some version of) Occam’s razor, then this must be because of some facts related to its prior, and because of some facts related to its parameter-function map. SLT does not say very much about either of these two things. What the post is about is primarily the relationship between the RLCT and posterior probability, and how this relationship can be used to reason about training dynamics. To connect this to Occam’s razor (or inductive bias more broadly), further assumptions and claims would be required.
Yes, thank you so much for taking the time to write those posts! They were very helpful for me to learn the basics of SLT.
I’m very glad to hear that! :)
Yes, I also believe that! The polynomial example is definitely pathological, and I do think that low λ almost certainly is correlated with simplicity in the case of neural networks. My point is more that the mathematics of SLT does not explain generalisation, and that additional assumptions definitely will be needed to derive specific claims about the inductive bias of neural networks.