I could have changed it to Why Neural Networks can obey Occam’s Razor, but I think this obscures the main point.
I think even this would be somewhat inaccurate (in my opinion). If a given parametric Bayesian learning machine does obey (some version of) Occam’s razor, then this must be because of some facts related to its prior, and because of some facts related to its parameter-function map. SLT does not say very much about either of these two things. What the post is about is primarily the relationship between the RLCT and posterior probability, and how this relationship can be used to reason about training dynamics. To connect this to Occam’s razor (or inductive bias more broadly), further assumptions and claims would be required.
At the time of writing, basically nobody knew anything about SLT
Yes, thank you so much for taking the time to write those posts! They were very helpful for me to learn the basics of SLT.
As we discussed at Berkeley, I do like the polynomial example you give and this whole discussion has made me think more carefully about various aspects of the story, so thanks for that.
I’m very glad to hear that! :)
My inclination is that the polynomial example is actually quite pathological and that there is a reasonable correlation between the RLCT and Kolmogorov complexity in practice
Yes, I also believe that! The polynomial example is definitely pathological, and I do think that low λ almost certainly is correlated with simplicity in the case of neural networks. My point is more that the mathematics of SLT does not explain generalisation, and that additional assumptions definitely will be needed to derive specific claims about the inductive bias of neural networks.
I think even this would be somewhat inaccurate (in my opinion). If a given parametric Bayesian learning machine does obey (some version of) Occam’s razor, then this must be because of some facts related to its prior, and because of some facts related to its parameter-function map. SLT does not say very much about either of these two things. What the post is about is primarily the relationship between the RLCT and posterior probability, and how this relationship can be used to reason about training dynamics. To connect this to Occam’s razor (or inductive bias more broadly), further assumptions and claims would be required.
Yes, thank you so much for taking the time to write those posts! They were very helpful for me to learn the basics of SLT.
I’m very glad to hear that! :)
Yes, I also believe that! The polynomial example is definitely pathological, and I do think that low λ almost certainly is correlated with simplicity in the case of neural networks. My point is more that the mathematics of SLT does not explain generalisation, and that additional assumptions definitely will be needed to derive specific claims about the inductive bias of neural networks.