I think there is some misunderstanding of what SLT says here, and you are identifying two distinct notions of complexity as the same, when in fact they are not. In particular, you have a line
“The generalisation bound that SLT proves is a kind of Bayesian sleight of hand, which says that the learning machine will have a good expected generalisation relative to the Bayesian prior that is implicit in the learning machine itself.”
I think this is precisely what SLT is saying, and this is nontrivial! One can say that a photon will follow a locally fastest route through a medium, even if this is different from saying that it will always follow the “simplest” route. SLT arguments always works relative to a loss landscape, and interpreting their meaning should (ideally) be done relative to the loss landscape. The resulting predictions are, nevertheless, nontrivial, and are sometimes confirmed. For example we have some work on this with Nina Rimsky.
You point at a different notion of complexity, associated to considering the parameter-function map. This also seems interesting, but is distinct from complexity phenomena in SLT (at least from the more basic concepts like the RLCT), and which is not considered in the basic SLT paradigm. Saying that this is another interesting avenue of study or a potentially useful measure of complexity is valid, but is a priori independent of criticism of SLT (and of course ideally, the two points of view could be combined).
Note that loss landscape considerations are more important than parameter-function considerations in the context of learning. For example it’s not clear in your example why f(x) = 0 is likely to be learned (unless you have weight regularization). Learning bias in a NN should most fundamentally be understood relative to the weights, not higher-order concepts like Kolmogorov complexity (though as you point out, there might be a relationship between the two).
Also I wanted to point out that in some ways, your “actual solution” is very close to the definition of RLCT from SLT. The definition of the RLCT is how much entropy you have to pay (in your language, the change in negative log probability of a random sample) to gain an exponential improvement of loss precision; i.e., “bits of specification per bit of loss”. See e.g. this article.
The thing is, the “complexity of f” (your K(f)) is not a very meaningful concept from the point of view of a neural net’s learning (you can try to make sense of it by looking at something like the entropy of the weight-to-function mapping, but then it won’t interact that much with learning dynamics). I think if you follow your intuitions carefully, you’re likely to precisely end up arriving at something like the RLCT (or maybe a finite-order approximation of the RLCT, associated to the free energy).
I have some criticisms of how SLT is understood and communicated, but I don’t think that the ones you mention seem that important to me. In particular, my intuition is that for purposes of empirical measurement of SLT parameters, the large-sample limit of realistic networks is quite large enough to see approximate singularities in the learning landscape, and that the SGD-sampling distinction is much more important than many people realize (indeed, there is no way to explain why generalizable networks like modular addition still sometimes memorize without understanding that the two are very distinct).
My main update in this field is that people should be more guided by empiricism and experiments, and less by competing paradigms of learning, which tend to be oversimplified and to fail to account for messy behaviors of even very simple toy networks. I’ve been pleasantly surprised by SLT making the same update in recent months.
I think this is precisely what SLT is saying, and this is nontrivial!
It is certainly non-trivial, in the sense that it takes many lines to prove, but I don’t think it tells you very much about the actual behaviour of neural networks.
Note that loss landscape considerations are more important than parameter-function considerations in the context of learning.
One of my core points is, precisely, to deny this claim. Without assumptions about the parameter function map, you cannot make inferences from the characteristics of the loss landscape to conclusions about generalisation behaviour, and understanding generalisation behaviour is crucial for understanding learning. (Unless you mean something like “convergence behaviour” when you say “in the context of learning”, in which case I agree, but then you would consider generalisation to be outside the scope of learning.)
For example it’s not clear in your example why f(x) = 0 is likely to be learned
My point is precisely that it is not likely to be learned, given the setup I provided, even though it should be learned.
Learning bias in a NN should most fundamentally be understood relative to the weights, not higher-order concepts like Kolmogorov complexity (though as you point out, there might be a relationship between the two).
There is a relationship between the two, and I claim that this relationship underlies the mechanism behind why neural networks work well compared to other learning machines.
The thing is, the “complexity of f” (your K(f)) is not a very meaningful concept from the point of view of a neural net’s learning
If we want to explain generalisation in neural networks, then we must explain if and how their inductive bias aligns with out (human) priors. Moreover, our human priors are (in most contexts) largely captured by computational complexity. Therefore, we must somewhere, in some way, connect neural networks to computational complexity.
indeed, there is no way to explain why generalizable networks like modular addition still sometimes memorize without understanding that the two are very distinct
Why not? The memorising solution has some nonzero “posterior” weight, so you would expect it to be found with some frequency. Does the empirical frequency of this solution deviate far from the theoretical prediction?
My point is precisely that it is not likely to be learned, given the setup I provided, even though it should be learned.
How am I supposed to read this?
What most of us need from a theory of deep learning is a predictive, explanatory account of how neural networks actually behave. If neural networks learn functions which are RLCT-simple rather than functions which are Kolmogorov-simple, then that means SLT is the better theory of deep learning.
I don’t know how to read “x^4 has lower RLCT than x^2 despite x^2 being k-simpler” as a critique of SLT unless there is an implicit assumption that neural networks do in fact find x^2 rather than x^4.
1. Neural networks do typically learn functions with low Kolmogorov complexity (otherwise they would not be able to generalise well). 2. It is a type error to describe a function as having low RLCT. A given function may have a high RLCT or a low RLCT, depending on the architecture of the learning machine. 3. The critique is against the supposition that we can use SLT to explain why neural networks generalise well in the small-data regime. The example provides a learning machine which would not generalise well, but which does fit all assumptions made my SLT. Hence, the SLT theorems which appear to prove that learning machines will generalise well when they are subject to the assumptions of SLT must in fact be showing something else.
My point is precisely that SLT does not give us a predictive account of how neural networks behave, in terms of generalisation and inductive bias, because it abstacts away from factors which are necessary to understand generalisation and inductive bias.
I don’t understand the strong link between Kolmogorov complexity and generalisation you’re suggesting here. I think by “generalisation” you must mean something more than “low test error”. Do you mean something like “out of distribution” generalisation (whatever that means)?
Kolmogorov complexity is definitely a misleading path here, and it’s unfortunate that Joar chose it as the “leading” example of complexity in the post. Note this passage:
However, they do not give a detailed answer to the question of precisely which complexity measure they minimise—they merely show that this result holds for many different complexity measures. For example, I would expect that fully connected neural networks are biased towards functions with low Boolean circuit complexity, or something very close to that. Verifying this claim, and deriving similar results about other kinds of network architectures, would make it easier to reason about what kinds of functions we should expect a neural network to be likely or unlikely to learn. This would also make it easier to reason about out-of-distribution generalisation, etc.
This quote from the above comment is better:
If we want to explain generalisation in neural networks, then we must explain if and how their inductive bias aligns with out (human) priors. Moreover, our human priors are (in most contexts) largely captured by computational complexity. Therefore, we must somewhere, in some way, connect neural networks to computational complexity.
Joar’s “DNNs are (kind of) Bayesian” (for the reasons that I don’t understand because I didn’t read their papers, so I just take his word here), and
Fields et al.‘s “brains are ‘almost’ Bayesian because Bayesian learning is information-efficient (= energy-efficient), and there is a strong evolutionary pressure for brains in animals to be energy-efficient”,
is this an explanation explanation of DNNs’ remarkable generalisation ability? Or more quantification should be added to both of these claims to turn this into a good explanation?
Thank you for this—I agree with what you are saying here. In the post, I went with a somewhat loose equivocation between “good priors” and “a prior towards low Kolmogorov complexity”, but this does skim past a lot of nuance. I do also very much not want to say that the DNN prior is exactly towards low Kolmogorov complexity (this would be uncomputable), but only that it is mostly correlated with Kolmogorov complexity for typical problems.
Yes, I mostly just mean “low test error”. I’m assuming that real-world problems follow a distribution that is similar to the Solomonoff prior (i.e., that data generating functions are more likely to have low Kolmogorov complexity than high Kolmogorov complexity) -- this is where the link is coming from. This is an assumption about the real world, and not something that can be established mathematically.
I think there is some misunderstanding of what SLT says here, and you are identifying two distinct notions of complexity as the same, when in fact they are not. In particular, you have a line
“The generalisation bound that SLT proves is a kind of Bayesian sleight of hand, which says that the learning machine will have a good expected generalisation relative to the Bayesian prior that is implicit in the learning machine itself.”
I think this is precisely what SLT is saying, and this is nontrivial! One can say that a photon will follow a locally fastest route through a medium, even if this is different from saying that it will always follow the “simplest” route. SLT arguments always works relative to a loss landscape, and interpreting their meaning should (ideally) be done relative to the loss landscape. The resulting predictions are, nevertheless, nontrivial, and are sometimes confirmed. For example we have some work on this with Nina Rimsky.
You point at a different notion of complexity, associated to considering the parameter-function map. This also seems interesting, but is distinct from complexity phenomena in SLT (at least from the more basic concepts like the RLCT), and which is not considered in the basic SLT paradigm. Saying that this is another interesting avenue of study or a potentially useful measure of complexity is valid, but is a priori independent of criticism of SLT (and of course ideally, the two points of view could be combined).
Note that loss landscape considerations are more important than parameter-function considerations in the context of learning. For example it’s not clear in your example why f(x) = 0 is likely to be learned (unless you have weight regularization). Learning bias in a NN should most fundamentally be understood relative to the weights, not higher-order concepts like Kolmogorov complexity (though as you point out, there might be a relationship between the two).
Also I wanted to point out that in some ways, your “actual solution” is very close to the definition of RLCT from SLT. The definition of the RLCT is how much entropy you have to pay (in your language, the change in negative log probability of a random sample) to gain an exponential improvement of loss precision; i.e., “bits of specification per bit of loss”. See e.g. this article.
The thing is, the “complexity of f” (your K(f)) is not a very meaningful concept from the point of view of a neural net’s learning (you can try to make sense of it by looking at something like the entropy of the weight-to-function mapping, but then it won’t interact that much with learning dynamics). I think if you follow your intuitions carefully, you’re likely to precisely end up arriving at something like the RLCT (or maybe a finite-order approximation of the RLCT, associated to the free energy).
I have some criticisms of how SLT is understood and communicated, but I don’t think that the ones you mention seem that important to me. In particular, my intuition is that for purposes of empirical measurement of SLT parameters, the large-sample limit of realistic networks is quite large enough to see approximate singularities in the learning landscape, and that the SGD-sampling distinction is much more important than many people realize (indeed, there is no way to explain why generalizable networks like modular addition still sometimes memorize without understanding that the two are very distinct).
My main update in this field is that people should be more guided by empiricism and experiments, and less by competing paradigms of learning, which tend to be oversimplified and to fail to account for messy behaviors of even very simple toy networks. I’ve been pleasantly surprised by SLT making the same update in recent months.
It is certainly non-trivial, in the sense that it takes many lines to prove, but I don’t think it tells you very much about the actual behaviour of neural networks.
One of my core points is, precisely, to deny this claim. Without assumptions about the parameter function map, you cannot make inferences from the characteristics of the loss landscape to conclusions about generalisation behaviour, and understanding generalisation behaviour is crucial for understanding learning. (Unless you mean something like “convergence behaviour” when you say “in the context of learning”, in which case I agree, but then you would consider generalisation to be outside the scope of learning.)
My point is precisely that it is not likely to be learned, given the setup I provided, even though it should be learned.
There is a relationship between the two, and I claim that this relationship underlies the mechanism behind why neural networks work well compared to other learning machines.
If we want to explain generalisation in neural networks, then we must explain if and how their inductive bias aligns with out (human) priors. Moreover, our human priors are (in most contexts) largely captured by computational complexity. Therefore, we must somewhere, in some way, connect neural networks to computational complexity.
Why not? The memorising solution has some nonzero “posterior” weight, so you would expect it to be found with some frequency. Does the empirical frequency of this solution deviate far from the theoretical prediction?
How am I supposed to read this?
What most of us need from a theory of deep learning is a predictive, explanatory account of how neural networks actually behave. If neural networks learn functions which are RLCT-simple rather than functions which are Kolmogorov-simple, then that means SLT is the better theory of deep learning.
I don’t know how to read “x^4 has lower RLCT than x^2 despite x^2 being k-simpler” as a critique of SLT unless there is an implicit assumption that neural networks do in fact find x^2 rather than x^4.
A few things:
1. Neural networks do typically learn functions with low Kolmogorov complexity (otherwise they would not be able to generalise well).
2. It is a type error to describe a function as having low RLCT. A given function may have a high RLCT or a low RLCT, depending on the architecture of the learning machine.
3. The critique is against the supposition that we can use SLT to explain why neural networks generalise well in the small-data regime. The example provides a learning machine which would not generalise well, but which does fit all assumptions made my SLT. Hence, the SLT theorems which appear to prove that learning machines will generalise well when they are subject to the assumptions of SLT must in fact be showing something else.
My point is precisely that SLT does not give us a predictive account of how neural networks behave, in terms of generalisation and inductive bias, because it abstacts away from factors which are necessary to understand generalisation and inductive bias.
I don’t understand the strong link between Kolmogorov complexity and generalisation you’re suggesting here. I think by “generalisation” you must mean something more than “low test error”. Do you mean something like “out of distribution” generalisation (whatever that means)?
Kolmogorov complexity is definitely a misleading path here, and it’s unfortunate that Joar chose it as the “leading” example of complexity in the post. Note this passage:
This quote from the above comment is better:
I’ve expressed this idea with some links here:
Then if we combine two claims:
Joar’s “DNNs are (kind of) Bayesian” (for the reasons that I don’t understand because I didn’t read their papers, so I just take his word here), and
Fields et al.‘s “brains are ‘almost’ Bayesian because Bayesian learning is information-efficient (= energy-efficient), and there is a strong evolutionary pressure for brains in animals to be energy-efficient”,
is this an explanation explanation of DNNs’ remarkable generalisation ability? Or more quantification should be added to both of these claims to turn this into a good explanation?
Thank you for this—I agree with what you are saying here. In the post, I went with a somewhat loose equivocation between “good priors” and “a prior towards low Kolmogorov complexity”, but this does skim past a lot of nuance. I do also very much not want to say that the DNN prior is exactly towards low Kolmogorov complexity (this would be uncomputable), but only that it is mostly correlated with Kolmogorov complexity for typical problems.
Yes, I mostly just mean “low test error”. I’m assuming that real-world problems follow a distribution that is similar to the Solomonoff prior (i.e., that data generating functions are more likely to have low Kolmogorov complexity than high Kolmogorov complexity) -- this is where the link is coming from. This is an assumption about the real world, and not something that can be established mathematically.