I think this is precisely what SLT is saying, and this is nontrivial!
It is certainly non-trivial, in the sense that it takes many lines to prove, but I don’t think it tells you very much about the actual behaviour of neural networks.
Note that loss landscape considerations are more important than parameter-function considerations in the context of learning.
One of my core points is, precisely, to deny this claim. Without assumptions about the parameter function map, you cannot make inferences from the characteristics of the loss landscape to conclusions about generalisation behaviour, and understanding generalisation behaviour is crucial for understanding learning. (Unless you mean something like “convergence behaviour” when you say “in the context of learning”, in which case I agree, but then you would consider generalisation to be outside the scope of learning.)
For example it’s not clear in your example why f(x) = 0 is likely to be learned
My point is precisely that it is not likely to be learned, given the setup I provided, even though it should be learned.
Learning bias in a NN should most fundamentally be understood relative to the weights, not higher-order concepts like Kolmogorov complexity (though as you point out, there might be a relationship between the two).
There is a relationship between the two, and I claim that this relationship underlies the mechanism behind why neural networks work well compared to other learning machines.
The thing is, the “complexity of f” (your K(f)) is not a very meaningful concept from the point of view of a neural net’s learning
If we want to explain generalisation in neural networks, then we must explain if and how their inductive bias aligns with out (human) priors. Moreover, our human priors are (in most contexts) largely captured by computational complexity. Therefore, we must somewhere, in some way, connect neural networks to computational complexity.
indeed, there is no way to explain why generalizable networks like modular addition still sometimes memorize without understanding that the two are very distinct
Why not? The memorising solution has some nonzero “posterior” weight, so you would expect it to be found with some frequency. Does the empirical frequency of this solution deviate far from the theoretical prediction?
My point is precisely that it is not likely to be learned, given the setup I provided, even though it should be learned.
How am I supposed to read this?
What most of us need from a theory of deep learning is a predictive, explanatory account of how neural networks actually behave. If neural networks learn functions which are RLCT-simple rather than functions which are Kolmogorov-simple, then that means SLT is the better theory of deep learning.
I don’t know how to read “x^4 has lower RLCT than x^2 despite x^2 being k-simpler” as a critique of SLT unless there is an implicit assumption that neural networks do in fact find x^2 rather than x^4.
1. Neural networks do typically learn functions with low Kolmogorov complexity (otherwise they would not be able to generalise well). 2. It is a type error to describe a function as having low RLCT. A given function may have a high RLCT or a low RLCT, depending on the architecture of the learning machine. 3. The critique is against the supposition that we can use SLT to explain why neural networks generalise well in the small-data regime. The example provides a learning machine which would not generalise well, but which does fit all assumptions made my SLT. Hence, the SLT theorems which appear to prove that learning machines will generalise well when they are subject to the assumptions of SLT must in fact be showing something else.
My point is precisely that SLT does not give us a predictive account of how neural networks behave, in terms of generalisation and inductive bias, because it abstacts away from factors which are necessary to understand generalisation and inductive bias.
I don’t understand the strong link between Kolmogorov complexity and generalisation you’re suggesting here. I think by “generalisation” you must mean something more than “low test error”. Do you mean something like “out of distribution” generalisation (whatever that means)?
Kolmogorov complexity is definitely a misleading path here, and it’s unfortunate that Joar chose it as the “leading” example of complexity in the post. Note this passage:
However, they do not give a detailed answer to the question of precisely which complexity measure they minimise—they merely show that this result holds for many different complexity measures. For example, I would expect that fully connected neural networks are biased towards functions with low Boolean circuit complexity, or something very close to that. Verifying this claim, and deriving similar results about other kinds of network architectures, would make it easier to reason about what kinds of functions we should expect a neural network to be likely or unlikely to learn. This would also make it easier to reason about out-of-distribution generalisation, etc.
This quote from the above comment is better:
If we want to explain generalisation in neural networks, then we must explain if and how their inductive bias aligns with out (human) priors. Moreover, our human priors are (in most contexts) largely captured by computational complexity. Therefore, we must somewhere, in some way, connect neural networks to computational complexity.
Joar’s “DNNs are (kind of) Bayesian” (for the reasons that I don’t understand because I didn’t read their papers, so I just take his word here), and
Fields et al.‘s “brains are ‘almost’ Bayesian because Bayesian learning is information-efficient (= energy-efficient), and there is a strong evolutionary pressure for brains in animals to be energy-efficient”,
is this an explanation explanation of DNNs’ remarkable generalisation ability? Or more quantification should be added to both of these claims to turn this into a good explanation?
Thank you for this—I agree with what you are saying here. In the post, I went with a somewhat loose equivocation between “good priors” and “a prior towards low Kolmogorov complexity”, but this does skim past a lot of nuance. I do also very much not want to say that the DNN prior is exactly towards low Kolmogorov complexity (this would be uncomputable), but only that it is mostly correlated with Kolmogorov complexity for typical problems.
Yes, I mostly just mean “low test error”. I’m assuming that real-world problems follow a distribution that is similar to the Solomonoff prior (i.e., that data generating functions are more likely to have low Kolmogorov complexity than high Kolmogorov complexity) -- this is where the link is coming from. This is an assumption about the real world, and not something that can be established mathematically.
It is certainly non-trivial, in the sense that it takes many lines to prove, but I don’t think it tells you very much about the actual behaviour of neural networks.
One of my core points is, precisely, to deny this claim. Without assumptions about the parameter function map, you cannot make inferences from the characteristics of the loss landscape to conclusions about generalisation behaviour, and understanding generalisation behaviour is crucial for understanding learning. (Unless you mean something like “convergence behaviour” when you say “in the context of learning”, in which case I agree, but then you would consider generalisation to be outside the scope of learning.)
My point is precisely that it is not likely to be learned, given the setup I provided, even though it should be learned.
There is a relationship between the two, and I claim that this relationship underlies the mechanism behind why neural networks work well compared to other learning machines.
If we want to explain generalisation in neural networks, then we must explain if and how their inductive bias aligns with out (human) priors. Moreover, our human priors are (in most contexts) largely captured by computational complexity. Therefore, we must somewhere, in some way, connect neural networks to computational complexity.
Why not? The memorising solution has some nonzero “posterior” weight, so you would expect it to be found with some frequency. Does the empirical frequency of this solution deviate far from the theoretical prediction?
How am I supposed to read this?
What most of us need from a theory of deep learning is a predictive, explanatory account of how neural networks actually behave. If neural networks learn functions which are RLCT-simple rather than functions which are Kolmogorov-simple, then that means SLT is the better theory of deep learning.
I don’t know how to read “x^4 has lower RLCT than x^2 despite x^2 being k-simpler” as a critique of SLT unless there is an implicit assumption that neural networks do in fact find x^2 rather than x^4.
A few things:
1. Neural networks do typically learn functions with low Kolmogorov complexity (otherwise they would not be able to generalise well).
2. It is a type error to describe a function as having low RLCT. A given function may have a high RLCT or a low RLCT, depending on the architecture of the learning machine.
3. The critique is against the supposition that we can use SLT to explain why neural networks generalise well in the small-data regime. The example provides a learning machine which would not generalise well, but which does fit all assumptions made my SLT. Hence, the SLT theorems which appear to prove that learning machines will generalise well when they are subject to the assumptions of SLT must in fact be showing something else.
My point is precisely that SLT does not give us a predictive account of how neural networks behave, in terms of generalisation and inductive bias, because it abstacts away from factors which are necessary to understand generalisation and inductive bias.
I don’t understand the strong link between Kolmogorov complexity and generalisation you’re suggesting here. I think by “generalisation” you must mean something more than “low test error”. Do you mean something like “out of distribution” generalisation (whatever that means)?
Kolmogorov complexity is definitely a misleading path here, and it’s unfortunate that Joar chose it as the “leading” example of complexity in the post. Note this passage:
This quote from the above comment is better:
I’ve expressed this idea with some links here:
Then if we combine two claims:
Joar’s “DNNs are (kind of) Bayesian” (for the reasons that I don’t understand because I didn’t read their papers, so I just take his word here), and
Fields et al.‘s “brains are ‘almost’ Bayesian because Bayesian learning is information-efficient (= energy-efficient), and there is a strong evolutionary pressure for brains in animals to be energy-efficient”,
is this an explanation explanation of DNNs’ remarkable generalisation ability? Or more quantification should be added to both of these claims to turn this into a good explanation?
Thank you for this—I agree with what you are saying here. In the post, I went with a somewhat loose equivocation between “good priors” and “a prior towards low Kolmogorov complexity”, but this does skim past a lot of nuance. I do also very much not want to say that the DNN prior is exactly towards low Kolmogorov complexity (this would be uncomputable), but only that it is mostly correlated with Kolmogorov complexity for typical problems.
Yes, I mostly just mean “low test error”. I’m assuming that real-world problems follow a distribution that is similar to the Solomonoff prior (i.e., that data generating functions are more likely to have low Kolmogorov complexity than high Kolmogorov complexity) -- this is where the link is coming from. This is an assumption about the real world, and not something that can be established mathematically.