We do have a lot of empirical evidence that simple functions indeed have bigger volumes in parameter-space (exponentially bigger volumes, in fact). See especially the 2nd and 3rd paper linked in the OP.
It’s true that we don’t have a rigorous mathematical proof for “why” simple functions should have larger volumes in parameter-space, but I can give an intuitive argument for why I think it would make sense to expect this to be the case.
First of all, we can look at the claim “backwards”; suppose a function has a large volume in parameter-space. This means that a lot of the parameters in a sense are redundant, so it should be possible to compress the representation of this function. Conversely, if a function has a small volume in parameter-space, then all the parameters of the network are necessary to pinpoint that function, and so writing out the entire structure of the network might be one of the shortest ways to represent that function. Therefore, we should expect functions with large volumes in parameter-space to be simple, and functions with small volumes to be complex.
There is a mathematical theorem, called the Levin bound, that roughly formalises this intuition. This bound says essentially that if we have a space of functions F, and a space of parameters P, and a parameter-function map m : P → F, where m itself has low complexity relative to the most complex functions in F (and F is overparameterised by P), then m will be exponentially biased towards simple functions (ie, if an element f of F has low complexity, then there is an exponentially larger number of elements of P that map to f via m).
The Levin bound doesn’t apply directly to neural networks, because it assumes that P is finite and discrete, but it gives some extra backing to the intuition above.
The Levin bound doesn’t apply directly to neural networks, because it assumes that P is finite and discrete, but it gives some extra backing to the intuition above.
In what sense is the parameter space of a neural network not finite and discrete? It is often useful to understand floating-point values as continuous, but in fact they are discrete such that it seems like a theorem which assumes discreteness would still apply.
I guess the question then basically is which level of abstraction we think would be the most informative or useful for understanding what’s going on here. I mean, we could for example also choose to take into account the fact that any actual computer program runs on a physical computer, which is governed by the laws of electromagnetism (in which case the parameter-space might count as continuous again).
I’m not sure if accounting for the floating-point implementation is informative or not in this case.
My understanding is that floating-point granularity is enough of a real problem that it does sometimes matter in realistic ML settings, which suggests that it’s probably a reasonable level of abstraction on which to analyze neural networks (whereas any additional insights from an electromagnetism-based analysis probably never matter, suggesting that’s not a reasonable/useful level of abstraction).
Thanks, this is great! OK, I’m pretty convinced now, but I was biased to believe something like this anyway. I would love to hear what a skeptic thinks.
I’m not sure if I count as a skeptic, but at least for me the only part of this that I find confusing is SGD not making a difference over random search. The fact that simple functions take up a larger volume in parameter space seems obviously true to me and I can’t really imagine anyone disagreeing with that part (though I’m still quite glad to have actual analysis to back that up).
Pinging you to see what your current thoughts are! I think that if “SGD is basically equivalent to random search” then that has huge, huge implications.
I guess I would say something like: random search is clearly a pretty good first-order approximation, but there are also clearly second-order effects. I think that exactly how strong/important/relevant those second-order effects are is unclear, however, and I remain pretty uncertain there.
Sounds like you might already know this, but it’s important to note that the Levin bound only implies that functions with large volumes will be simple—it doesn’t imply that simple functions will have large volume. It’s the latter claim that’s difficult to show, but also more important for understanding the power of neural nets, since it would imply that they are Solomonoff-induction-like, able to learn arbitrary functions given enough training data. In particular you’d want some sort of exponential lower bound on the volume of parameter-space occupied by each function of given complexity. The statement derived from the Levin bound is pretty trivial and would hold even if the network could only represent linear functions, constant functions, or some other quite limited class.
Sorry I didn’t notice this earlier! What do you think about the argument that Joar gave?
If a function is small-volume, it’s complex, because it takes a lot of parameters to specify.
If a function is large-volume, it’s simple, because it can be compressed a lot since most parameters are redundant.
It sounds like you are saying: Some small-volume functions are actually simple, or at least this might be the case for all we know, because maybe it’s just really hard for neural networks to efficiently represent that function. This is especially clear when we think of simplicity in the minimum description length / Kolmogorov sense; the “+BusyBeaver(9)” function can be written in a few lines of code but would require a neural net larger than the universe to implement. Am I interpreting you correctly? Do you think there are other important senses of simplicity besides that one?
Yeah, exactly—the problem is that there are some small-volume functions which are actually simple. The argument for small-volume --> complex doesn’t go through since there could be other ways of specifying the function.
Other senses of simplicity include various circuit complexities and Levin complexity. There’s no argument that parameter-space volume corresponds to either of them AFAIK(you might think parameter-space volume would correspond to “neural net complexity”, the number of neurons in a minimal-size neural net needed to compute the function, but I don’t think this is true either—every parameter is Gaussian so it’s unlikely for most to be zero)
The second Towards Data Science post references this paper, which is also the main reference through which Levin’s result is mentioned in the first paper you post. So I assume reading these references should be enough to get the gist.
We do have a lot of empirical evidence that simple functions indeed have bigger volumes in parameter-space (exponentially bigger volumes, in fact). See especially the 2nd and 3rd paper linked in the OP.
It’s true that we don’t have a rigorous mathematical proof for “why” simple functions should have larger volumes in parameter-space, but I can give an intuitive argument for why I think it would make sense to expect this to be the case.
First of all, we can look at the claim “backwards”; suppose a function has a large volume in parameter-space. This means that a lot of the parameters in a sense are redundant, so it should be possible to compress the representation of this function. Conversely, if a function has a small volume in parameter-space, then all the parameters of the network are necessary to pinpoint that function, and so writing out the entire structure of the network might be one of the shortest ways to represent that function. Therefore, we should expect functions with large volumes in parameter-space to be simple, and functions with small volumes to be complex.
There is a mathematical theorem, called the Levin bound, that roughly formalises this intuition. This bound says essentially that if we have a space of functions F, and a space of parameters P, and a parameter-function map m : P → F, where m itself has low complexity relative to the most complex functions in F (and F is overparameterised by P), then m will be exponentially biased towards simple functions (ie, if an element f of F has low complexity, then there is an exponentially larger number of elements of P that map to f via m).
The Levin bound doesn’t apply directly to neural networks, because it assumes that P is finite and discrete, but it gives some extra backing to the intuition above.
Also, check out this.
In what sense is the parameter space of a neural network not finite and discrete? It is often useful to understand floating-point values as continuous, but in fact they are discrete such that it seems like a theorem which assumes discreteness would still apply.
Yes, it does of course apply in that sense.
I guess the question then basically is which level of abstraction we think would be the most informative or useful for understanding what’s going on here. I mean, we could for example also choose to take into account the fact that any actual computer program runs on a physical computer, which is governed by the laws of electromagnetism (in which case the parameter-space might count as continuous again).
I’m not sure if accounting for the floating-point implementation is informative or not in this case.
My understanding is that floating-point granularity is enough of a real problem that it does sometimes matter in realistic ML settings, which suggests that it’s probably a reasonable level of abstraction on which to analyze neural networks (whereas any additional insights from an electromagnetism-based analysis probably never matter, suggesting that’s not a reasonable/useful level of abstraction).
Thanks, this is great! OK, I’m pretty convinced now, but I was biased to believe something like this anyway. I would love to hear what a skeptic thinks.
I’m not sure if I count as a skeptic, but at least for me the only part of this that I find confusing is SGD not making a difference over random search. The fact that simple functions take up a larger volume in parameter space seems obviously true to me and I can’t really imagine anyone disagreeing with that part (though I’m still quite glad to have actual analysis to back that up).
Pinging you to see what your current thoughts are! I think that if “SGD is basically equivalent to random search” then that has huge, huge implications.
I guess I would say something like: random search is clearly a pretty good first-order approximation, but there are also clearly second-order effects. I think that exactly how strong/important/relevant those second-order effects are is unclear, however, and I remain pretty uncertain there.
Sounds like you might already know this, but it’s important to note that the Levin bound only implies that functions with large volumes will be simple—it doesn’t imply that simple functions will have large volume. It’s the latter claim that’s difficult to show, but also more important for understanding the power of neural nets, since it would imply that they are Solomonoff-induction-like, able to learn arbitrary functions given enough training data. In particular you’d want some sort of exponential lower bound on the volume of parameter-space occupied by each function of given complexity. The statement derived from the Levin bound is pretty trivial and would hold even if the network could only represent linear functions, constant functions, or some other quite limited class.
Sorry I didn’t notice this earlier! What do you think about the argument that Joar gave?
If a function is small-volume, it’s complex, because it takes a lot of parameters to specify.
If a function is large-volume, it’s simple, because it can be compressed a lot since most parameters are redundant.
It sounds like you are saying: Some small-volume functions are actually simple, or at least this might be the case for all we know, because maybe it’s just really hard for neural networks to efficiently represent that function. This is especially clear when we think of simplicity in the minimum description length / Kolmogorov sense; the “+BusyBeaver(9)” function can be written in a few lines of code but would require a neural net larger than the universe to implement. Am I interpreting you correctly? Do you think there are other important senses of simplicity besides that one?
Yeah, exactly—the problem is that there are some small-volume functions which are actually simple. The argument for small-volume --> complex doesn’t go through since there could be other ways of specifying the function.
Other senses of simplicity include various circuit complexities and Levin complexity. There’s no argument that parameter-space volume corresponds to either of them AFAIK(you might think parameter-space volume would correspond to “neural net complexity”, the number of neurons in a minimal-size neural net needed to compute the function, but I don’t think this is true either—every parameter is Gaussian so it’s unlikely for most to be zero)
Do you have a good reference for the Levin bound? My attempts at finding a relevant paper all failed.
The second Towards Data Science post references this paper, which is also the main reference through which Levin’s result is mentioned in the first paper you post. So I assume reading these references should be enough to get the gist.
There is a proof of it in “An Introduction to Kolmogorov Complexity and Its Applications” by Ming Li & Paul Vitanyi.