--The stuff linked in this post hypothesizes that simple functions enjoy bigger volume in parameter-space, i.e. there are more possible combinations of neuron weights that add up to a simple function than a complex one. Combined with the plausible hypothesis that real-world data tends to have simple functions that fit it, this explains why neural networks generalize well, and the explanation has nothing to do with SGD and everything to do with the prior, so to speak, implicit in the parameter space.
--However, it doesn’t prove this hypothesis, or give a theoretical argument for it of the sort “Here’s why, given what we know about math and about complexity theory, simple functions should enjoy bigger volume in parameter-space.” Rather, the post (and the related papers) argues that this hypothesis is true because it’s an elegant hypothesis that explains the data well (and they have various experiments to back this up.)
--Hence, you think there still remains the mystery of why this hypothesis is true—why simple functions enjoy bigger volume—and you think the answer probably has something to do with the architecture rather than SGD, but you link to some stuff about SGD that seems relevant.
I’m not at all sure I’m following, which is why I’m asking. :)
We do have a lot of empirical evidence that simple functions indeed have bigger volumes in parameter-space (exponentially bigger volumes, in fact). See especially the 2nd and 3rd paper linked in the OP.
It’s true that we don’t have a rigorous mathematical proof for “why” simple functions should have larger volumes in parameter-space, but I can give an intuitive argument for why I think it would make sense to expect this to be the case.
First of all, we can look at the claim “backwards”; suppose a function has a large volume in parameter-space. This means that a lot of the parameters in a sense are redundant, so it should be possible to compress the representation of this function. Conversely, if a function has a small volume in parameter-space, then all the parameters of the network are necessary to pinpoint that function, and so writing out the entire structure of the network might be one of the shortest ways to represent that function. Therefore, we should expect functions with large volumes in parameter-space to be simple, and functions with small volumes to be complex.
There is a mathematical theorem, called the Levin bound, that roughly formalises this intuition. This bound says essentially that if we have a space of functions F, and a space of parameters P, and a parameter-function map m : P → F, where m itself has low complexity relative to the most complex functions in F (and F is overparameterised by P), then m will be exponentially biased towards simple functions (ie, if an element f of F has low complexity, then there is an exponentially larger number of elements of P that map to f via m).
The Levin bound doesn’t apply directly to neural networks, because it assumes that P is finite and discrete, but it gives some extra backing to the intuition above.
The Levin bound doesn’t apply directly to neural networks, because it assumes that P is finite and discrete, but it gives some extra backing to the intuition above.
In what sense is the parameter space of a neural network not finite and discrete? It is often useful to understand floating-point values as continuous, but in fact they are discrete such that it seems like a theorem which assumes discreteness would still apply.
I guess the question then basically is which level of abstraction we think would be the most informative or useful for understanding what’s going on here. I mean, we could for example also choose to take into account the fact that any actual computer program runs on a physical computer, which is governed by the laws of electromagnetism (in which case the parameter-space might count as continuous again).
I’m not sure if accounting for the floating-point implementation is informative or not in this case.
My understanding is that floating-point granularity is enough of a real problem that it does sometimes matter in realistic ML settings, which suggests that it’s probably a reasonable level of abstraction on which to analyze neural networks (whereas any additional insights from an electromagnetism-based analysis probably never matter, suggesting that’s not a reasonable/useful level of abstraction).
Thanks, this is great! OK, I’m pretty convinced now, but I was biased to believe something like this anyway. I would love to hear what a skeptic thinks.
I’m not sure if I count as a skeptic, but at least for me the only part of this that I find confusing is SGD not making a difference over random search. The fact that simple functions take up a larger volume in parameter space seems obviously true to me and I can’t really imagine anyone disagreeing with that part (though I’m still quite glad to have actual analysis to back that up).
Pinging you to see what your current thoughts are! I think that if “SGD is basically equivalent to random search” then that has huge, huge implications.
I guess I would say something like: random search is clearly a pretty good first-order approximation, but there are also clearly second-order effects. I think that exactly how strong/important/relevant those second-order effects are is unclear, however, and I remain pretty uncertain there.
Sounds like you might already know this, but it’s important to note that the Levin bound only implies that functions with large volumes will be simple—it doesn’t imply that simple functions will have large volume. It’s the latter claim that’s difficult to show, but also more important for understanding the power of neural nets, since it would imply that they are Solomonoff-induction-like, able to learn arbitrary functions given enough training data. In particular you’d want some sort of exponential lower bound on the volume of parameter-space occupied by each function of given complexity. The statement derived from the Levin bound is pretty trivial and would hold even if the network could only represent linear functions, constant functions, or some other quite limited class.
Sorry I didn’t notice this earlier! What do you think about the argument that Joar gave?
If a function is small-volume, it’s complex, because it takes a lot of parameters to specify.
If a function is large-volume, it’s simple, because it can be compressed a lot since most parameters are redundant.
It sounds like you are saying: Some small-volume functions are actually simple, or at least this might be the case for all we know, because maybe it’s just really hard for neural networks to efficiently represent that function. This is especially clear when we think of simplicity in the minimum description length / Kolmogorov sense; the “+BusyBeaver(9)” function can be written in a few lines of code but would require a neural net larger than the universe to implement. Am I interpreting you correctly? Do you think there are other important senses of simplicity besides that one?
Yeah, exactly—the problem is that there are some small-volume functions which are actually simple. The argument for small-volume --> complex doesn’t go through since there could be other ways of specifying the function.
Other senses of simplicity include various circuit complexities and Levin complexity. There’s no argument that parameter-space volume corresponds to either of them AFAIK(you might think parameter-space volume would correspond to “neural net complexity”, the number of neurons in a minimal-size neural net needed to compute the function, but I don’t think this is true either—every parameter is Gaussian so it’s unlikely for most to be zero)
The second Towards Data Science post references this paper, which is also the main reference through which Levin’s result is mentioned in the first paper you post. So I assume reading these references should be enough to get the gist.
Hmmm… I don’t know if that’s how I would describe what’s happening. I would say:
The above post provides empirical evidence that there isn’t much difference between the generalization performance of “doing SGD on DNNs until you get some level of performance” and “randomly sampling DNN weights until you get some level of performance.”
I find that result difficult to reconcile with both theoretical and empirical arguments for why SGD should be different than random sampling, such as the experiments I ran and linked above.
The answer to this question, one way or another, is important for understanding the DNN prior, where it’s coming from, and what sorts of things it’s likely to incentivize—e.g. is there a simplicity bias which might incentivize mesa-optimization or incentivize pseudo-alignment?
OK, thanks! The first bullet point is I think a summary of my first two bullet points, but with different emphasis. I’ll check out the experiments you linked.
I’m curious to know what you mean by “focus on the architecture instead.” My guess is that if the OP is right, pretty much any neural net architecture will have a simplicity bias.
My guess is that if the OP is right, pretty much any neural net architecture will have a simplicity bias.
In my opinion, I think it’s certainly and obviously true that neural net architecture is a huge contributor to inductive biases and comes with a strong simplicity bias. What’s surprising to me is that I would expect a similar thing to be true of SGD and yet these results seem to indicate that SGD vs. random search has only a pretty minimal effect on inductive biases.
I’m having trouble parsing your first sentence—are you saying that yes, pretty much any neural net architecture will have a simplicity bias, but saying also that the biases will be importantly different depending on specifically which architecture you pick?
I think I would have predicted that SGD vs. random search would have a pretty minimal effect on inductive biases. My pet theory for why neural nets have a bias towards simplicity is that there are more ways for neural nets to encode simple functions than complex functions, i.e. larger regions of parameter space for simple functions. As the OP argues (I think) if this is right, then it makes sense that SGD and random search don’t affect the bias that much, since larger regions of parameter space will also have larger basins of attraction for SGD to roll down. (As for the justification of my pet theory, well, this is really sketchy but see my top-level comment below)
Important point here: the relevant notion of “simple” is not “low Kolmogorov complexity”; it’s more like smoothness. A bias toward “simple” functions, for this notion of “simple”, would mainly make interpolation work well, not necessarily other prediction problems.
How so? In particular, are you saying that gaussian processes bias towards low Kolmogorov complexity, or are you saying that there’s some simplicitly prior here besides what gaussian processes have? If the former, how does a gaussian process bias towards short programs other than the compressibility offered by smoothness?
(Of course smoothness itself implies some compressibility in the Kolmogorov picture; the interesting question is whether there’s a bias towards functions which can be compressed in more general ways.)
The things Joar is saying there generally match my intuitions, although there are some things unsaid where my intuitions might diverge more.
There’s about a gazillion measures of “simplicity”. If we pick some simplicity measure M, the usual rule is that things with low M-complexity have low Kolmogorov complexity, but things with low Kolmogorov complexity don’t necessarily have low M-complexity. For instance, if a file can be compressed a lot by a ZIP file encoding, then that file has low Kolmogorov complexity, but not all low Kolmogorov complexity strings can be compressed by ZIP specifically.
In this context, smoothness is one such relevant measure: smooth functions have low Kolmogorov complexity, but there are other ways to have low Kolmogorov complexity without being smooth. I don’t know about the Levin bound specifically, but in math these sorts of theorems are usually about smoothness. In complexity theory in particular, theorems often connect smoothness measures to circuit size measures (which are another class of complexity measures which imply low Kolmogorov complexity but not vice versa).
Roughly speaking, if a complexity measure were as general as Kolmogorov complexity, then we’d expect it to be uncomputable. If we can actually find low-complexity functions under a complexity measure M, then that complexity measure is probably less general than Kolmogorov complexity. From there, the natural next question is exactly what generality is kept/lost, and whether the loss of generality actually matters for things in the real world.
I do not wish to claim that all functions with low Kolmogorov complexity have large volumes in the parameter-space of a sufficiently large neural network. In fact, I can point to several concrete counterexamples to this claim. To give one example, the identity function certainly has a low Kolmogorov complexity, but it’s very difficult for a (fully connected feed-forward) neural network to learn this function (if the input and output is represented in binary form by a bit string). If you try to learn this function by training on only odd numbers then the network will not robustly generalise to even numbers (or vice versa). Similarly, if you train using only number in a certain range then the network will not robustly generalise outside this range. This is because a pattern such as “the n’th input neuron is equal to the n’th output neuron” lacks a simple representation in a neural network (and hence this function has a small parameter-space volume, even though it has low Kolmogorov complexity). The same goes for the function that recognises palindromes, and etc.
So, I agree that there are certain functions with low Kolmogorov complexity that a neural network normally cannot “see” properly. I also think one could frame a lot of the research on developing new neural network architectures as being about making neural networks able to “see” more kinds of functions. For example, NALUs give neural networks the ability to “see” arithmetic relations more easily. I hence certainly think it’s a very relevant question which complexity measure best describes the bias in neural networks (and I think this actually matters for practical problems). Note that the identity function is very smooth.
This is a bit of a tangent, but the Levin bound is actually about Kolmogorov complexity. It’s a fairly simple theorem; the proof is constructive, and basically shows that a given function f which corresponds to many parameters in the parameter-space cannot be too complex, by constructing a simple program which computes f. Very roughly, if the parameter-space is finite and discrete, then we could construct a Huffman code for the function space (where the distribution over the function-space is the distribution that corresponds to the uniform distribution over the parameter-space). We can then make a computer program that computes f by concatenating the Huffman code of f and the parameter-function map m (which gives an upper bound on the Kolmogorov complexity of functions with large volumes). Of course, this theorem does not per se actually apply to neural networks, since it assumes that the parameter-space is finite and discrete, so in this context it’s essentially just an intuition pump.
Based on that description, it should be straightforward to generalize the Levin bound to neural networks. The main step would be to replace the Huffman code with a turbocode (or any other near-Shannon-bound code), at which point the compressibility is basically identical to the log probability density, and we can take the limit to continuous function space without any trouble. The main change is that entropy would become relative entropy (as is normal when taking info theory bounds to a continuous limit). Intuitively, it’s just using the usual translation between probability theory and minimum description length, and applying it to the probability density of parameter space.
Thanks, that helps. Perhaps an example would be: A purely feed-forward neural network might be “blind” to algorithms that are kolmogorov-simple but which involve repeatedly performing the same procedure a bunch of times (even if it is technically big enough to contain such an algorithm). So the simplicity bias of said network would be importantly different from kolmogorov complexity.
To make sure I am following:
--The stuff linked in this post hypothesizes that simple functions enjoy bigger volume in parameter-space, i.e. there are more possible combinations of neuron weights that add up to a simple function than a complex one. Combined with the plausible hypothesis that real-world data tends to have simple functions that fit it, this explains why neural networks generalize well, and the explanation has nothing to do with SGD and everything to do with the prior, so to speak, implicit in the parameter space.
--However, it doesn’t prove this hypothesis, or give a theoretical argument for it of the sort “Here’s why, given what we know about math and about complexity theory, simple functions should enjoy bigger volume in parameter-space.” Rather, the post (and the related papers) argues that this hypothesis is true because it’s an elegant hypothesis that explains the data well (and they have various experiments to back this up.)
--Hence, you think there still remains the mystery of why this hypothesis is true—why simple functions enjoy bigger volume—and you think the answer probably has something to do with the architecture rather than SGD, but you link to some stuff about SGD that seems relevant.
I’m not at all sure I’m following, which is why I’m asking. :)
We do have a lot of empirical evidence that simple functions indeed have bigger volumes in parameter-space (exponentially bigger volumes, in fact). See especially the 2nd and 3rd paper linked in the OP.
It’s true that we don’t have a rigorous mathematical proof for “why” simple functions should have larger volumes in parameter-space, but I can give an intuitive argument for why I think it would make sense to expect this to be the case.
First of all, we can look at the claim “backwards”; suppose a function has a large volume in parameter-space. This means that a lot of the parameters in a sense are redundant, so it should be possible to compress the representation of this function. Conversely, if a function has a small volume in parameter-space, then all the parameters of the network are necessary to pinpoint that function, and so writing out the entire structure of the network might be one of the shortest ways to represent that function. Therefore, we should expect functions with large volumes in parameter-space to be simple, and functions with small volumes to be complex.
There is a mathematical theorem, called the Levin bound, that roughly formalises this intuition. This bound says essentially that if we have a space of functions F, and a space of parameters P, and a parameter-function map m : P → F, where m itself has low complexity relative to the most complex functions in F (and F is overparameterised by P), then m will be exponentially biased towards simple functions (ie, if an element f of F has low complexity, then there is an exponentially larger number of elements of P that map to f via m).
The Levin bound doesn’t apply directly to neural networks, because it assumes that P is finite and discrete, but it gives some extra backing to the intuition above.
Also, check out this.
In what sense is the parameter space of a neural network not finite and discrete? It is often useful to understand floating-point values as continuous, but in fact they are discrete such that it seems like a theorem which assumes discreteness would still apply.
Yes, it does of course apply in that sense.
I guess the question then basically is which level of abstraction we think would be the most informative or useful for understanding what’s going on here. I mean, we could for example also choose to take into account the fact that any actual computer program runs on a physical computer, which is governed by the laws of electromagnetism (in which case the parameter-space might count as continuous again).
I’m not sure if accounting for the floating-point implementation is informative or not in this case.
My understanding is that floating-point granularity is enough of a real problem that it does sometimes matter in realistic ML settings, which suggests that it’s probably a reasonable level of abstraction on which to analyze neural networks (whereas any additional insights from an electromagnetism-based analysis probably never matter, suggesting that’s not a reasonable/useful level of abstraction).
Thanks, this is great! OK, I’m pretty convinced now, but I was biased to believe something like this anyway. I would love to hear what a skeptic thinks.
I’m not sure if I count as a skeptic, but at least for me the only part of this that I find confusing is SGD not making a difference over random search. The fact that simple functions take up a larger volume in parameter space seems obviously true to me and I can’t really imagine anyone disagreeing with that part (though I’m still quite glad to have actual analysis to back that up).
Pinging you to see what your current thoughts are! I think that if “SGD is basically equivalent to random search” then that has huge, huge implications.
I guess I would say something like: random search is clearly a pretty good first-order approximation, but there are also clearly second-order effects. I think that exactly how strong/important/relevant those second-order effects are is unclear, however, and I remain pretty uncertain there.
Sounds like you might already know this, but it’s important to note that the Levin bound only implies that functions with large volumes will be simple—it doesn’t imply that simple functions will have large volume. It’s the latter claim that’s difficult to show, but also more important for understanding the power of neural nets, since it would imply that they are Solomonoff-induction-like, able to learn arbitrary functions given enough training data. In particular you’d want some sort of exponential lower bound on the volume of parameter-space occupied by each function of given complexity. The statement derived from the Levin bound is pretty trivial and would hold even if the network could only represent linear functions, constant functions, or some other quite limited class.
Sorry I didn’t notice this earlier! What do you think about the argument that Joar gave?
If a function is small-volume, it’s complex, because it takes a lot of parameters to specify.
If a function is large-volume, it’s simple, because it can be compressed a lot since most parameters are redundant.
It sounds like you are saying: Some small-volume functions are actually simple, or at least this might be the case for all we know, because maybe it’s just really hard for neural networks to efficiently represent that function. This is especially clear when we think of simplicity in the minimum description length / Kolmogorov sense; the “+BusyBeaver(9)” function can be written in a few lines of code but would require a neural net larger than the universe to implement. Am I interpreting you correctly? Do you think there are other important senses of simplicity besides that one?
Yeah, exactly—the problem is that there are some small-volume functions which are actually simple. The argument for small-volume --> complex doesn’t go through since there could be other ways of specifying the function.
Other senses of simplicity include various circuit complexities and Levin complexity. There’s no argument that parameter-space volume corresponds to either of them AFAIK(you might think parameter-space volume would correspond to “neural net complexity”, the number of neurons in a minimal-size neural net needed to compute the function, but I don’t think this is true either—every parameter is Gaussian so it’s unlikely for most to be zero)
Do you have a good reference for the Levin bound? My attempts at finding a relevant paper all failed.
The second Towards Data Science post references this paper, which is also the main reference through which Levin’s result is mentioned in the first paper you post. So I assume reading these references should be enough to get the gist.
There is a proof of it in “An Introduction to Kolmogorov Complexity and Its Applications” by Ming Li & Paul Vitanyi.
Hmmm… I don’t know if that’s how I would describe what’s happening. I would say:
The above post provides empirical evidence that there isn’t much difference between the generalization performance of “doing SGD on DNNs until you get some level of performance” and “randomly sampling DNN weights until you get some level of performance.”
I find that result difficult to reconcile with both theoretical and empirical arguments for why SGD should be different than random sampling, such as the experiments I ran and linked above.
The answer to this question, one way or another, is important for understanding the DNN prior, where it’s coming from, and what sorts of things it’s likely to incentivize—e.g. is there a simplicity bias which might incentivize mesa-optimization or incentivize pseudo-alignment?
OK, thanks! The first bullet point is I think a summary of my first two bullet points, but with different emphasis. I’ll check out the experiments you linked.
I’m curious to know what you mean by “focus on the architecture instead.” My guess is that if the OP is right, pretty much any neural net architecture will have a simplicity bias.
In my opinion, I think it’s certainly and obviously true that neural net architecture is a huge contributor to inductive biases and comes with a strong simplicity bias. What’s surprising to me is that I would expect a similar thing to be true of SGD and yet these results seem to indicate that SGD vs. random search has only a pretty minimal effect on inductive biases.
I’m having trouble parsing your first sentence—are you saying that yes, pretty much any neural net architecture will have a simplicity bias, but saying also that the biases will be importantly different depending on specifically which architecture you pick?
I think I would have predicted that SGD vs. random search would have a pretty minimal effect on inductive biases. My pet theory for why neural nets have a bias towards simplicity is that there are more ways for neural nets to encode simple functions than complex functions, i.e. larger regions of parameter space for simple functions. As the OP argues (I think) if this is right, then it makes sense that SGD and random search don’t affect the bias that much, since larger regions of parameter space will also have larger basins of attraction for SGD to roll down. (As for the justification of my pet theory, well, this is really sketchy but see my top-level comment below)
Important point here: the relevant notion of “simple” is not “low Kolmogorov complexity”; it’s more like smoothness. A bias toward “simple” functions, for this notion of “simple”, would mainly make interpolation work well, not necessarily other prediction problems.
I’m not sure I agree with this—I think Kolmogorov complexity is a relevant notion of complexity in this context.
How so? In particular, are you saying that gaussian processes bias towards low Kolmogorov complexity, or are you saying that there’s some simplicitly prior here besides what gaussian processes have? If the former, how does a gaussian process bias towards short programs other than the compressibility offered by smoothness?
(Of course smoothness itself implies some compressibility in the Kolmogorov picture; the interesting question is whether there’s a bias towards functions which can be compressed in more general ways.)
I’d be interested to hear your take on this comment above, which is talking about compressibility, which to me sounds like kolmogorov complexity.
The things Joar is saying there generally match my intuitions, although there are some things unsaid where my intuitions might diverge more.
There’s about a gazillion measures of “simplicity”. If we pick some simplicity measure M, the usual rule is that things with low M-complexity have low Kolmogorov complexity, but things with low Kolmogorov complexity don’t necessarily have low M-complexity. For instance, if a file can be compressed a lot by a ZIP file encoding, then that file has low Kolmogorov complexity, but not all low Kolmogorov complexity strings can be compressed by ZIP specifically.
In this context, smoothness is one such relevant measure: smooth functions have low Kolmogorov complexity, but there are other ways to have low Kolmogorov complexity without being smooth. I don’t know about the Levin bound specifically, but in math these sorts of theorems are usually about smoothness. In complexity theory in particular, theorems often connect smoothness measures to circuit size measures (which are another class of complexity measures which imply low Kolmogorov complexity but not vice versa).
Roughly speaking, if a complexity measure were as general as Kolmogorov complexity, then we’d expect it to be uncomputable. If we can actually find low-complexity functions under a complexity measure M, then that complexity measure is probably less general than Kolmogorov complexity. From there, the natural next question is exactly what generality is kept/lost, and whether the loss of generality actually matters for things in the real world.
Ah, I certainly agree with this.
I do not wish to claim that all functions with low Kolmogorov complexity have large volumes in the parameter-space of a sufficiently large neural network. In fact, I can point to several concrete counterexamples to this claim. To give one example, the identity function certainly has a low Kolmogorov complexity, but it’s very difficult for a (fully connected feed-forward) neural network to learn this function (if the input and output is represented in binary form by a bit string). If you try to learn this function by training on only odd numbers then the network will not robustly generalise to even numbers (or vice versa). Similarly, if you train using only number in a certain range then the network will not robustly generalise outside this range. This is because a pattern such as “the n’th input neuron is equal to the n’th output neuron” lacks a simple representation in a neural network (and hence this function has a small parameter-space volume, even though it has low Kolmogorov complexity). The same goes for the function that recognises palindromes, and etc.
So, I agree that there are certain functions with low Kolmogorov complexity that a neural network normally cannot “see” properly. I also think one could frame a lot of the research on developing new neural network architectures as being about making neural networks able to “see” more kinds of functions. For example, NALUs give neural networks the ability to “see” arithmetic relations more easily. I hence certainly think it’s a very relevant question which complexity measure best describes the bias in neural networks (and I think this actually matters for practical problems). Note that the identity function is very smooth.
This is a bit of a tangent, but the Levin bound is actually about Kolmogorov complexity. It’s a fairly simple theorem; the proof is constructive, and basically shows that a given function f which corresponds to many parameters in the parameter-space cannot be too complex, by constructing a simple program which computes f. Very roughly, if the parameter-space is finite and discrete, then we could construct a Huffman code for the function space (where the distribution over the function-space is the distribution that corresponds to the uniform distribution over the parameter-space). We can then make a computer program that computes f by concatenating the Huffman code of f and the parameter-function map m (which gives an upper bound on the Kolmogorov complexity of functions with large volumes). Of course, this theorem does not per se actually apply to neural networks, since it assumes that the parameter-space is finite and discrete, so in this context it’s essentially just an intuition pump.
That’s a clever example, I like it.
Based on that description, it should be straightforward to generalize the Levin bound to neural networks. The main step would be to replace the Huffman code with a turbocode (or any other near-Shannon-bound code), at which point the compressibility is basically identical to the log probability density, and we can take the limit to continuous function space without any trouble. The main change is that entropy would become relative entropy (as is normal when taking info theory bounds to a continuous limit). Intuitively, it’s just using the usual translation between probability theory and minimum description length, and applying it to the probability density of parameter space.
Thanks, that helps. Perhaps an example would be: A purely feed-forward neural network might be “blind” to algorithms that are kolmogorov-simple but which involve repeatedly performing the same procedure a bunch of times (even if it is technically big enough to contain such an algorithm). So the simplicity bias of said network would be importantly different from kolmogorov complexity.
That’s exactly right. That exact example would be a case of high circuit size but low Kolmogorov complexity.