Like Rohin, I’m not impressed with the information theoretic side of this work.
Specifically, I’m wary of the focus on measuring complexity for functions between finite sets, such as binary functions.
Mostly, we care about NN generalization on problems where the input space is continuous, generally R^n. The authors argue that the finite-set results are relevant to these problems, because one can always discretize R^n to get a finite set. I don’t think this captures the kinds of function complexity we care about for NNs.
Consider:
If A, B are finite sets, then there are a finite number of functions f:A→B. Let’s write S for the finite set of such functions.
The authors view the counting measure on S -- where every function is equally likely—as “unbiased.”
This choice makes sense if A, B are truly unstructured collections of objects with no intrinsic meaning.
However, if there is some extra structure on them like a metric, it’s no longer clear that “all functions are equally likely” is the right reference point.
Imposing a constraint that functions should use/respect the extra structure, even in some mild way like continuity, may pick out a tiny subset of S relative to the counting measure.
Finally, if we pick a measure of simplicity that happens to judge this subset to be unusually simple, then any prior that prefers mildly reasonable functions (eg continuous ones) will look like a simplicity prior.
This is much too coarse a lens for distinguishing NNs from other statistical learning techniques, since all of them are generally going to involve putting a metric on the input space.
Let’s see how this goes wrong in the Shannon entropy argument from this paper.
The authors consider (a quantity equivalent to) the fraction of inputs in S for which a given function outputs 1.
They consider a function simpler if this fraction is close to 1 or 0, because then it’s easier to compress.
With the counting measure, “most” functions output 1 about half of the time. (Like the binomial distribution—there are lots of different ways you can flip 5 tails and 5 heads, but only one way to flip 10 heads.)
To learn binary functions with an NN, they encode the inputs as binary vectors like (1,0,1). They study what happens when you feed these to either (A) linear model, or (B) a ReLu stack, with random weights.
It turns out that the functions expressed by these models are much more likely than the counting measure to assign a single label (1 or 0) to most outputs.
Why?
For an random function on an input space of size 2n, you need to roll 2n independent random variables. Each roll affects only one input element.
But when you encode the inputs as vectors of length n and feed them into a model, the layers of the model have weights that are also n-vectors. Each of their components affects many input elements at once, in the same direction. This makes it likely for the judgments to clump towards 1 or 0.
For example, with the linear model with no threshold, if we roll a weight vector whose elements are all positive, then every input maps to 1. This happens a fraction 1/n of the time. But only one boolean function maps every input to 1, so the counting measure would give this probability 1/(2n).
This doesn’t seem like a special property of neural nets. It just seems like a result of assigning a normed vector space structure to the inputs, and preferring functions that “use” the structure in their labeling rule. “Using” the structure means any decision you make about how to treat one input element has implications for others (because they’re close to it, or point in the same direction, or something). Thus you have fewer independent decisions to make, and there’s a higher probability they all push in the same direction.
Sort of similar remarks apply to the other complexity measure used by authors, LZ complexity. Unlike the complexity measure discussed above, this one does implicitly put a structure on the input space (by fixing an enumeration of it, where the inputs are taken to be bit vectors, and the enumeration reads them off in binary).
“Simple” functions in the LZ sense are thus ones that respond to binary vectors in (roughly) a predictable way,. What does it mean for a function to respond to binary vectors in a predictable way? It means that knowing the values of some of the bits provides information about the output, even if you don’t know all of them. But since our models are encoding the inputs as binary vectors, we are already setting them up to have properties like this.
Like Rohin, I’m not impressed with the information theoretic side of this work.
Specifically, I’m wary of the focus on measuring complexity for functions between finite sets, such as binary functions.
Mostly, we care about NN generalization on problems where the input space is continuous, generally R^n. The authors argue that the finite-set results are relevant to these problems, because one can always discretize R^n to get a finite set. I don’t think this captures the kinds of function complexity we care about for NNs.
We’re not saying that discrete complexity measures fully capture what we care about for NNs! We do however think that they are sufficiently relevant to be informative for the bigger picture, even if just as a proxy for what we actually care about.
Most complexity measures give roughly similar values for the (relative) complexity of most objects, so our assumption is that if something is the case for a bunch of different tractable complexity measures, then this is also likely to be the case for whatever the “ideal” complexity measure would be in the relevant case. In particular, if P(f)≃2−K(x)+C regardless of whether K represents Boolean complexity, or LZ complexity, etc, then this is also likely to be true for the “ideal” complexity measure for neural networks.
Also: since we’re estimating various probabilities by sampling, we basically need to discretise the function space. If you have any concrete suggestions for how to get around this then we’re all ears!
As for the rest of your comment—what you’re saying here seems true to me, but I’m not sure I see how any of this is a counterpoint to anything we’re saying?
Most complexity measures give roughly similar values for the (relative) complexity of most objects
I’ll write mostly about this statement, as I think it’s the crux of our disagreement.
The statement may be true as long as we hold the meaning of “objects” constant as we vary the complexity measure.
However, if we translate objects from one mathematical space to another (say by discretizing, or adding/removing a metric structure), we can’t simply say the complexity measures for space A on the original A-objects inevitably agree with those space B on the translated B-objects. Whether this is true depends on our choice of translation.
(This is clear in the trivial cases of bad translation where we, say, map every A-object onto the same B-object. Now, obviously, no one would consider this a correct or adequate way to associate A-objects with B-objects. But the example shows that the claim about complexity measures will only hold if our translation is “good enough” in some sense. If we don’t have any idea what “good enough” means, something is missing from the story.)
In the problem at hand, the worrying part of the translation from real to boolean inputs is the loss of metric structure. (More precisely, the hand-waviness about what metric structure survives the translation, if any.) If there’s no metric, this destroys the information needed by complexity measures that care about how easy it is to reconstruct an object “close to” the specified one.
Basic information theory doesn’t require a metric, only a measure. There’s no sense of “getting an output approximately right,” only of “getting the exactly right output with high probability.” If you care about being approximately right according to some metric, this leads you to rate-distortion theory.
Both of these domains—information theory without a metric, and with one—define notions of incompressibility/complexity, but they’re different. Consider two distributions on R:
The standard normal,
The standard normal, but you chop it into a trillion pieces on the x axis, and translate the pieces to arbitrary locations in R
According to basic information theory, these are equally simple/compressible. (They have the same differential entropy, or the same K-L divergence from a uniform distribution if you want to be pedantic.)
But in rate-distortion theory, (1) is way more simple/compressible than (2). If you’re coding (2) over a noisy channel, you have to distinguish really hard between (say) a piece that stayed in place at [0, 0.1] and another piece that got translated to [1e8, 1e8 + 0.1]. Whereas if you’re coding a standard normal, with its light tails, a 1e8-magnitude mistake is effectively impossible.
If you do all your analysis in the metric-less space, hoping it will cleanly pass over to the metric space at the end, you have no way of distinguishing these two possibilities. When you remove the metric, they’re identical. So you have limited power to predict what the rate-distortion theory notion of complexity is going to say, once you put the metric back in.
Like Rohin, I’m not impressed with the information theoretic side of this work.
Specifically, I’m wary of the focus on measuring complexity for functions between finite sets, such as binary functions.
Mostly, we care about NN generalization on problems where the input space is continuous, generally R^n. The authors argue that the finite-set results are relevant to these problems, because one can always discretize R^n to get a finite set. I don’t think this captures the kinds of function complexity we care about for NNs.
Consider:
If A, B are finite sets, then there are a finite number of functions f:A→B. Let’s write S for the finite set of such functions.
The authors view the counting measure on S -- where every function is equally likely—as “unbiased.”
This choice makes sense if A, B are truly unstructured collections of objects with no intrinsic meaning.
However, if there is some extra structure on them like a metric, it’s no longer clear that “all functions are equally likely” is the right reference point.
Imposing a constraint that functions should use/respect the extra structure, even in some mild way like continuity, may pick out a tiny subset of S relative to the counting measure.
Finally, if we pick a measure of simplicity that happens to judge this subset to be unusually simple, then any prior that prefers mildly reasonable functions (eg continuous ones) will look like a simplicity prior.
This is much too coarse a lens for distinguishing NNs from other statistical learning techniques, since all of them are generally going to involve putting a metric on the input space.
Let’s see how this goes wrong in the Shannon entropy argument from this paper.
The authors consider (a quantity equivalent to) the fraction of inputs in S for which a given function outputs 1.
They consider a function simpler if this fraction is close to 1 or 0, because then it’s easier to compress.
With the counting measure, “most” functions output 1 about half of the time. (Like the binomial distribution—there are lots of different ways you can flip 5 tails and 5 heads, but only one way to flip 10 heads.)
To learn binary functions with an NN, they encode the inputs as binary vectors like (1,0,1). They study what happens when you feed these to either (A) linear model, or (B) a ReLu stack, with random weights.
It turns out that the functions expressed by these models are much more likely than the counting measure to assign a single label (1 or 0) to most outputs.
Why?
For an random function on an input space of size 2n, you need to roll 2n independent random variables. Each roll affects only one input element.
But when you encode the inputs as vectors of length n and feed them into a model, the layers of the model have weights that are also n-vectors. Each of their components affects many input elements at once, in the same direction. This makes it likely for the judgments to clump towards 1 or 0.
For example, with the linear model with no threshold, if we roll a weight vector whose elements are all positive, then every input maps to 1. This happens a fraction 1/n of the time. But only one boolean function maps every input to 1, so the counting measure would give this probability 1/(2n).
This doesn’t seem like a special property of neural nets. It just seems like a result of assigning a normed vector space structure to the inputs, and preferring functions that “use” the structure in their labeling rule. “Using” the structure means any decision you make about how to treat one input element has implications for others (because they’re close to it, or point in the same direction, or something). Thus you have fewer independent decisions to make, and there’s a higher probability they all push in the same direction.
Sort of similar remarks apply to the other complexity measure used by authors, LZ complexity. Unlike the complexity measure discussed above, this one does implicitly put a structure on the input space (by fixing an enumeration of it, where the inputs are taken to be bit vectors, and the enumeration reads them off in binary).
“Simple” functions in the LZ sense are thus ones that respond to binary vectors in (roughly) a predictable way,. What does it mean for a function to respond to binary vectors in a predictable way? It means that knowing the values of some of the bits provides information about the output, even if you don’t know all of them. But since our models are encoding the inputs as binary vectors, we are already setting them up to have properties like this.
We’re not saying that discrete complexity measures fully capture what we care about for NNs! We do however think that they are sufficiently relevant to be informative for the bigger picture, even if just as a proxy for what we actually care about.
Most complexity measures give roughly similar values for the (relative) complexity of most objects, so our assumption is that if something is the case for a bunch of different tractable complexity measures, then this is also likely to be the case for whatever the “ideal” complexity measure would be in the relevant case. In particular, if P(f)≃2−K(x)+C regardless of whether K represents Boolean complexity, or LZ complexity, etc, then this is also likely to be true for the “ideal” complexity measure for neural networks.
Also: since we’re estimating various probabilities by sampling, we basically need to discretise the function space. If you have any concrete suggestions for how to get around this then we’re all ears!
As for the rest of your comment—what you’re saying here seems true to me, but I’m not sure I see how any of this is a counterpoint to anything we’re saying?
I’ll write mostly about this statement, as I think it’s the crux of our disagreement.
The statement may be true as long as we hold the meaning of “objects” constant as we vary the complexity measure.
However, if we translate objects from one mathematical space to another (say by discretizing, or adding/removing a metric structure), we can’t simply say the complexity measures for space A on the original A-objects inevitably agree with those space B on the translated B-objects. Whether this is true depends on our choice of translation.
(This is clear in the trivial cases of bad translation where we, say, map every A-object onto the same B-object. Now, obviously, no one would consider this a correct or adequate way to associate A-objects with B-objects. But the example shows that the claim about complexity measures will only hold if our translation is “good enough” in some sense. If we don’t have any idea what “good enough” means, something is missing from the story.)
In the problem at hand, the worrying part of the translation from real to boolean inputs is the loss of metric structure. (More precisely, the hand-waviness about what metric structure survives the translation, if any.) If there’s no metric, this destroys the information needed by complexity measures that care about how easy it is to reconstruct an object “close to” the specified one.
Basic information theory doesn’t require a metric, only a measure. There’s no sense of “getting an output approximately right,” only of “getting the exactly right output with high probability.” If you care about being approximately right according to some metric, this leads you to rate-distortion theory.
Both of these domains—information theory without a metric, and with one—define notions of incompressibility/complexity, but they’re different. Consider two distributions on R:
The standard normal,
The standard normal, but you chop it into a trillion pieces on the x axis, and translate the pieces to arbitrary locations in R
According to basic information theory, these are equally simple/compressible. (They have the same differential entropy, or the same K-L divergence from a uniform distribution if you want to be pedantic.)
But in rate-distortion theory, (1) is way more simple/compressible than (2). If you’re coding (2) over a noisy channel, you have to distinguish really hard between (say) a piece that stayed in place at [0, 0.1] and another piece that got translated to [1e8, 1e8 + 0.1]. Whereas if you’re coding a standard normal, with its light tails, a 1e8-magnitude mistake is effectively impossible.
If you do all your analysis in the metric-less space, hoping it will cleanly pass over to the metric space at the end, you have no way of distinguishing these two possibilities. When you remove the metric, they’re identical. So you have limited power to predict what the rate-distortion theory notion of complexity is going to say, once you put the metric back in.