I’m a bit confused or I think you need some additional caveats in the intro. I would have said that Bayesian statistics with typical models is well understood as bias toward short description length but there are important caveats in the neural network case, at least conceptually.
(That said, minimum description length is off by constant factor, but this doesn’t seem to be what you’re getting at.)
You say:
TLDR: The simplicity bias in Bayesian statistics is not just a bias towards short description length.
And
while it is true that the fewer parameters you use the better, the true complexity measure which appears in the mathematical theory of Bayesian statistics (that is, singular learning theory) is more exotic
But later you say:
The relation between description length and simplicity biases in Bayesian statistics is well-known, but is a phenomena that is confined to regular models and this class of models does not include neural networks.
A statistical model is regular if it is identifiable and the Fisher information matrix is everywhere nondegenerate. Statistical models where the prediction involves feeding samples from the input distribution through neural networks are not regular.
Regular models are the ones for which there is a link between low description length and low free energy (i.e. the class of models which the Bayesian posterior tends to prefer are those that are assigned lower description length, at the same level of accuracy).
It’s not really accurate to describe regular models as “typical”, especially not on LW where we are generally speaking about neural networks when we think of machine learning.
It’s true that the example presented in this post is, potentially, not typical (it’s not a neural network nor is it a standard kind of statistical model). So it’s unclear to what extent this observation generalises. However, it does illustrate the general point that it is a mistake to presume that intuitions based on regular models hold for general statistical models.
A pervasive failure mode in modern ML is to take intuitions developed for regular models, and assume they hold “with some caveats” for neural networks. We have at this point many examples where this leads one badly astray, and in my opinion the intuition I see widely shared here on LW about neural network inductive biases and description length falls into this bucket.
I don’t claim to know the content of those inductive biases, but my guess is that it is much more interesting and complex than “something like description length”.
[minor]
I’m a bit confused or I think you need some additional caveats in the intro. I would have said that Bayesian statistics with typical models is well understood as bias toward short description length but there are important caveats in the neural network case, at least conceptually.
(That said, minimum description length is off by constant factor, but this doesn’t seem to be what you’re getting at.)
You say:
And
But later you say:
Maybe I can clarify a few points here:
A statistical model is regular if it is identifiable and the Fisher information matrix is everywhere nondegenerate. Statistical models where the prediction involves feeding samples from the input distribution through neural networks are not regular.
Regular models are the ones for which there is a link between low description length and low free energy (i.e. the class of models which the Bayesian posterior tends to prefer are those that are assigned lower description length, at the same level of accuracy).
It’s not really accurate to describe regular models as “typical”, especially not on LW where we are generally speaking about neural networks when we think of machine learning.
It’s true that the example presented in this post is, potentially, not typical (it’s not a neural network nor is it a standard kind of statistical model). So it’s unclear to what extent this observation generalises. However, it does illustrate the general point that it is a mistake to presume that intuitions based on regular models hold for general statistical models.
A pervasive failure mode in modern ML is to take intuitions developed for regular models, and assume they hold “with some caveats” for neural networks. We have at this point many examples where this leads one badly astray, and in my opinion the intuition I see widely shared here on LW about neural network inductive biases and description length falls into this bucket.
I don’t claim to know the content of those inductive biases, but my guess is that it is much more interesting and complex than “something like description length”.