Suppose it’s the early twentieth century, and we’re trying to quantify the concept of “information”. Specifically, we want to measure “how much information” one variable contains about another—for instance, how much information a noisy measurement of the temperature of an engine contains about the actual engine temperature.
Along comes Karl Pearson, and suggests using his “correlation coefficient” (specifically the square of the correlation coefficient, ρ(X,Y)2). As a measure of information, this has some sensible properties:
If there’s no information, then ρ(X,Y)2 is zero.
If ρ(X,Y)2 is one, then there’s perfect information—one variable tells us everything there is to know about the other.
It’s symmetric: the amount of information which X tells us about Y equals the amount of information which Y tells us about X.
As an added bonus, it’s mathematically simple to calculate, estimate, and manipulate. Sure, it’s not very “principled”, but it seems like a good-enough measure to work with.
Now an engineer from Bell Telephone shows up with a real-world problem: they’ve been contracted to create secure communications for the military. They want to ensure that externally-visible data Y contains no information about secret message X, so they need a way to measure “how much information” one variable contains about another. What a perfect use-case! We advise them to design their system so that X and Y have zero correlation.
A few years later, Bell Telephone gets a visit from a very unhappy colonel. Apparently the enemy has been reading their messages. Zero correlation was not enough to keep the secret messages secret.
Now, Bell could patch over this problem. For instance, they could pick a bunch of functions like X2, sin(Y), eX+2X−1, etc, and require that those also be uncorrelated. With enough functions, and a wide enough variety, that might be enough… but it’s going to get very complicated very quickly, with all these new design constraints piling up.
Fortunately, off in a corner of Bell Labs, one of their researchers already has an alternative solution. Claude Shannon suggests quantifying “how much information” X contains about Y using his “mutual information” metric I(X;Y). This has a bunch of sensible properties, but the main argument is that I(X;Y) is exactly the difference between the average number of bits one needs to send in a message in order to communicate the value of X, and the average number of bits one needs to send to communicate X if the receiving party already knows Y. It’s the number of bits “savable” by knowing Y. By imagining different things as the “message” and thinking about how hard it is to guess X after knowing Y, we can intuitively predict that this metric will apply to lots of different situations, including Bell’s secret message problem.
Shannon advises the engineers to design their system so that X and Y have zero mutual information. And now, the enemy can’t read their messages quite so easily.
Proxies vs Definitions
In this story, what does the correlation coefficient do “wrong” which mutual information does “right”? What’s the generalizable lesson here?
The immediate difference is that correlation is a proxy for amount of information, while mutual information is a true definition/metric. When we apply optimization pressure to a proxy, it breaks down—that’s Goodheart’s Law. In this case, the optimization pressure is a literal adversary trying to read our secret messages. The optimizer finds the corner cases where our proxy no longer perfectly captures our intuitive idea of “no information”, and they’re able to extract information about our secret messages. Correlation doesn’t capture our intuitive notion of “information which X contains about Y” well enough for zero correlation to prevent our adversaries from reading our messages.
Mutual information, by contrast, handles the optimization pressure just fine. We intuitively expect that “Y contains zero information about X” is enough to keep our messages secret, even in the presence of adversaries, and the mutual information definition of “information” is indeed enough to match that intuitive expectation.
So… that’s all well and good. We want definitions/metrics which are robust to optimization pressure, rather than proxies which break down. But how do we find robust definitions/metrics in the first place? In the long run, of course, we can try out a metric on lots of different problems, prove lots of different theorems about it, and get an idea of robustness that way. But there are infinitely many possible metrics for any given concept; we don’t have time to go through that whole process for all of them. How do we figure out in advance what the robust concept definitions are?
You Already Know The Answer
A classic quote from famed physicist John Archibald Wheeler: “Never make a calculation until you know the answer”.
In math, it’s very easy to write down some expressions or equations or definitions, and start pushing symbols around, without having any idea what the answer looks like or how to get there. In undergrad math classes, this often works, because the problem is set up so that there’s only a handful of things which you can do at all. In research, we don’t have that guardrail, and we especially don’t have that guardrail when finding the right definitions is part of the problem. I have literally spent months pushing symbols around without getting anywhere at all. Math is a high-dimensional space; brute force search does not work.
Bottom line: if we want to get anywhere, we need to already have at least some intuition for what we’re looking for, and we need that intuition to guide the search. “Never make a calculation until you know the answer” is the sort of lesson which gets beaten in by months or years of failure to follow it.
Fortunately, we already have a lot of intuition to lean on, even without years of mathematical study. For instance, if we look back at the information example from earlier… what are the intuitive arguments for why correlation seems like a reasonable measure of information?
If there’s no information, then ρ(X,Y)2 is zero.
If ρ(X,Y)2 is one, then there’s perfect information—one variable tells us everything there is to know about the other.
These seemed pretty natural, right? This is exactly what “knowing the answer” looks like—we have some intuition about what properties a measure of “information” should have. In the case of mutual information, the intuition was this argument:
I(X; Y) is exactly the difference between the average number of bits one needs to send in a message in order to communicate the value of X, and the average number of bits one needs to send to communicate X if the receiving party already knows Y. It’s the number of bits “savable” by knowing Y. By imagining different things as the “message” and thinking about how hard it is to guess X after knowing Y, we can intuitively guess that this metric will apply to lots of different situations…
These are the kinds of intuitions which guide our search in the high-dimensional space of mathematical definitions/metrics.
Note that the engineers’ idea that “data Y contains no information about secret message X” should be sufficient to prevent adversaries from reading the messages is also an intuitive property of information. Assuming our intuitions about information are correct (or at least approximately correct), a definition which fully captures our intuitive idea of information should imply this property. If it doesn’t, then either (a) our definition does not fully capture our intuitive idea of information, or (b) our intuition is wrong (in which case we should be able to translate the math back into an intuitive example of how our previous intuition failed).
… But Have You Fully Specified The Answer?
So, math is high-dimensional, we need intuitions to guide our search. But both the correlation coefficient and mutual information have some intuitive arguments for why they’re good measures of information. What’s the difference? What makes one better than the other?
Let’s go back to the two intuitive arguments for the correlation coefficient:
If there’s no information, then ρ(X,Y)2 is zero.
If ρ(X,Y)2 is one, then there’s perfect information—one variable tells us everything there is to know about the other.
Key thing to notice: ρ(X,Y)2 is not the only metric which satisfies these two criteria. For instance, we could exponentiate X and Y and then take the correlation, ρ(eX,eY)2, and both properties still apply. Same with ρ(X2,sin(Y)+2Y−3)2. There’s lots of degrees of freedom here; these two intuitive arguments are not enough to uniquely specify the correlation coefficient as our definition/metric.
By contrast, consider Shannon’s argument:
I(X; Y) is exactly the difference between the average number of bits one needs to send in a message in order to communicate the value of X, and the average number of bits one needs to send to communicate X if the receiving party already knows Y.
This has zero degrees of freedom. This argument (with a couple approximations) is enough to uniquely specify Shannon’s formula for mutual information.
Adam Shimi gave a great analogy for this: the intuitive arguments are like a set of equations, and the definition/metric is like a solution. Ideally, we want the “equations” to nail down one unique “solution”. If that’s the case, then there’s only one definition compatible with our intuitive arguments. If we intuitively expect some additional properties to hold (e.g. “no information” being sufficient to prevent adversaries from reading our secret messages), then either they have to hold for that one definition, or our intuition is wrong.
On the other hand, if our “equations” have multiple “solutions”, then it’s kind of misleading to pick out one solution and declare that to be our answer. Why that solution? If there’s lots of different definitions/metrics which satisfy the intuitive arguments for correlation, then why not use one of the others? More to the point: how do we know our intuition itself isn’t built around some other metric which satisfies the properties? We believe our intuitive concept satisfies the listed properties, and we believe our intuitive concept satisfies some more general properties as well (e.g. “no information” protecting secret messages”), but that does not mean that any random definition compatible with the listed properties is sufficient to imply the more general properties. If we want our intuition to apply, then we need to find the definition/metric which actually corresponds to our intuitive concept (assuming such a definition/metric exists), not just some proxy which satisfies a few of the same properties.
Recap
We want mathematical definitions/metrics which are robust—in particular, they should not break down when we apply optimization pressure. In the long run, we can verify robustness by using a definition/metric in lots of different problems and proving theorems about it. But math-space is high dimensional, so we need a more efficient way to search for good definitions/metrics.
One main way we do this is to lean on intuitions. We already have intuitive concepts, and we have some beliefs about the properties those concepts should have. If we can accurately translate our intuitive concepts into mathematical definitions/metrics, then they should satisfy the intuitively-expected properties. (Or else our intuitions are wrong, and a good definition/metric should convince us of that when the definition doesn’t satisfy an expected property.)
The key challenge here is to come up with a set of intuitive arguments which uniquely specify a particular definition/metric, exactly like a set of equations can uniquely specify a solution. If our arguments have “many solutions”, then there’s little reason to expect that the ad-hoc “solution” we chose actually corresponds to our intuitive concept. If our chosen definition/metric does not correspond to our intuitive concept, then even if our intuition is correct, it shouldn’t be too surprising if the definition/metric fails to have more general properties which we intuitively expect.
In short: if our arguments are not sufficient to uniquely nail down one definition/metric, then we lose our main reason to expect the definition/metric to be robust.
Thankyou to Adam Shimi for a conversation which led to this post.
What’s So Bad About Ad-Hoc Mathematical Definitions?
Suppose it’s the early twentieth century, and we’re trying to quantify the concept of “information”. Specifically, we want to measure “how much information” one variable contains about another—for instance, how much information a noisy measurement of the temperature of an engine contains about the actual engine temperature.
Along comes Karl Pearson, and suggests using his “correlation coefficient” (specifically the square of the correlation coefficient, ρ(X,Y)2). As a measure of information, this has some sensible properties:
If there’s no information, then ρ(X,Y)2 is zero.
If ρ(X,Y)2 is one, then there’s perfect information—one variable tells us everything there is to know about the other.
It’s symmetric: the amount of information which X tells us about Y equals the amount of information which Y tells us about X.
As an added bonus, it’s mathematically simple to calculate, estimate, and manipulate. Sure, it’s not very “principled”, but it seems like a good-enough measure to work with.
Now an engineer from Bell Telephone shows up with a real-world problem: they’ve been contracted to create secure communications for the military. They want to ensure that externally-visible data Y contains no information about secret message X, so they need a way to measure “how much information” one variable contains about another. What a perfect use-case! We advise them to design their system so that X and Y have zero correlation.
A few years later, Bell Telephone gets a visit from a very unhappy colonel. Apparently the enemy has been reading their messages. Zero correlation was not enough to keep the secret messages secret.
Now, Bell could patch over this problem. For instance, they could pick a bunch of functions like X2, sin(Y), eX+2X−1, etc, and require that those also be uncorrelated. With enough functions, and a wide enough variety, that might be enough… but it’s going to get very complicated very quickly, with all these new design constraints piling up.
Fortunately, off in a corner of Bell Labs, one of their researchers already has an alternative solution. Claude Shannon suggests quantifying “how much information” X contains about Y using his “mutual information” metric I(X;Y). This has a bunch of sensible properties, but the main argument is that I(X;Y) is exactly the difference between the average number of bits one needs to send in a message in order to communicate the value of X, and the average number of bits one needs to send to communicate X if the receiving party already knows Y. It’s the number of bits “savable” by knowing Y. By imagining different things as the “message” and thinking about how hard it is to guess X after knowing Y, we can intuitively predict that this metric will apply to lots of different situations, including Bell’s secret message problem.
Shannon advises the engineers to design their system so that X and Y have zero mutual information. And now, the enemy can’t read their messages quite so easily.
Proxies vs Definitions
In this story, what does the correlation coefficient do “wrong” which mutual information does “right”? What’s the generalizable lesson here?
The immediate difference is that correlation is a proxy for amount of information, while mutual information is a true definition/metric. When we apply optimization pressure to a proxy, it breaks down—that’s Goodheart’s Law. In this case, the optimization pressure is a literal adversary trying to read our secret messages. The optimizer finds the corner cases where our proxy no longer perfectly captures our intuitive idea of “no information”, and they’re able to extract information about our secret messages. Correlation doesn’t capture our intuitive notion of “information which X contains about Y” well enough for zero correlation to prevent our adversaries from reading our messages.
Mutual information, by contrast, handles the optimization pressure just fine. We intuitively expect that “Y contains zero information about X” is enough to keep our messages secret, even in the presence of adversaries, and the mutual information definition of “information” is indeed enough to match that intuitive expectation.
So… that’s all well and good. We want definitions/metrics which are robust to optimization pressure, rather than proxies which break down. But how do we find robust definitions/metrics in the first place? In the long run, of course, we can try out a metric on lots of different problems, prove lots of different theorems about it, and get an idea of robustness that way. But there are infinitely many possible metrics for any given concept; we don’t have time to go through that whole process for all of them. How do we figure out in advance what the robust concept definitions are?
You Already Know The Answer
A classic quote from famed physicist John Archibald Wheeler: “Never make a calculation until you know the answer”.
In math, it’s very easy to write down some expressions or equations or definitions, and start pushing symbols around, without having any idea what the answer looks like or how to get there. In undergrad math classes, this often works, because the problem is set up so that there’s only a handful of things which you can do at all. In research, we don’t have that guardrail, and we especially don’t have that guardrail when finding the right definitions is part of the problem. I have literally spent months pushing symbols around without getting anywhere at all. Math is a high-dimensional space; brute force search does not work.
Bottom line: if we want to get anywhere, we need to already have at least some intuition for what we’re looking for, and we need that intuition to guide the search. “Never make a calculation until you know the answer” is the sort of lesson which gets beaten in by months or years of failure to follow it.
Fortunately, we already have a lot of intuition to lean on, even without years of mathematical study. For instance, if we look back at the information example from earlier… what are the intuitive arguments for why correlation seems like a reasonable measure of information?
If there’s no information, then ρ(X,Y)2 is zero.
If ρ(X,Y)2 is one, then there’s perfect information—one variable tells us everything there is to know about the other.
These seemed pretty natural, right? This is exactly what “knowing the answer” looks like—we have some intuition about what properties a measure of “information” should have. In the case of mutual information, the intuition was this argument:
These are the kinds of intuitions which guide our search in the high-dimensional space of mathematical definitions/metrics.
Note that the engineers’ idea that “data Y contains no information about secret message X” should be sufficient to prevent adversaries from reading the messages is also an intuitive property of information. Assuming our intuitions about information are correct (or at least approximately correct), a definition which fully captures our intuitive idea of information should imply this property. If it doesn’t, then either (a) our definition does not fully capture our intuitive idea of information, or (b) our intuition is wrong (in which case we should be able to translate the math back into an intuitive example of how our previous intuition failed).
… But Have You Fully Specified The Answer?
So, math is high-dimensional, we need intuitions to guide our search. But both the correlation coefficient and mutual information have some intuitive arguments for why they’re good measures of information. What’s the difference? What makes one better than the other?
Let’s go back to the two intuitive arguments for the correlation coefficient:
If there’s no information, then ρ(X,Y)2 is zero.
If ρ(X,Y)2 is one, then there’s perfect information—one variable tells us everything there is to know about the other.
Key thing to notice: ρ(X,Y)2 is not the only metric which satisfies these two criteria. For instance, we could exponentiate X and Y and then take the correlation, ρ(eX,eY)2, and both properties still apply. Same with ρ(X2,sin(Y)+2Y−3)2. There’s lots of degrees of freedom here; these two intuitive arguments are not enough to uniquely specify the correlation coefficient as our definition/metric.
By contrast, consider Shannon’s argument:
This has zero degrees of freedom. This argument (with a couple approximations) is enough to uniquely specify Shannon’s formula for mutual information.
Adam Shimi gave a great analogy for this: the intuitive arguments are like a set of equations, and the definition/metric is like a solution. Ideally, we want the “equations” to nail down one unique “solution”. If that’s the case, then there’s only one definition compatible with our intuitive arguments. If we intuitively expect some additional properties to hold (e.g. “no information” being sufficient to prevent adversaries from reading our secret messages), then either they have to hold for that one definition, or our intuition is wrong.
On the other hand, if our “equations” have multiple “solutions”, then it’s kind of misleading to pick out one solution and declare that to be our answer. Why that solution? If there’s lots of different definitions/metrics which satisfy the intuitive arguments for correlation, then why not use one of the others? More to the point: how do we know our intuition itself isn’t built around some other metric which satisfies the properties? We believe our intuitive concept satisfies the listed properties, and we believe our intuitive concept satisfies some more general properties as well (e.g. “no information” protecting secret messages”), but that does not mean that any random definition compatible with the listed properties is sufficient to imply the more general properties. If we want our intuition to apply, then we need to find the definition/metric which actually corresponds to our intuitive concept (assuming such a definition/metric exists), not just some proxy which satisfies a few of the same properties.
Recap
We want mathematical definitions/metrics which are robust—in particular, they should not break down when we apply optimization pressure. In the long run, we can verify robustness by using a definition/metric in lots of different problems and proving theorems about it. But math-space is high dimensional, so we need a more efficient way to search for good definitions/metrics.
One main way we do this is to lean on intuitions. We already have intuitive concepts, and we have some beliefs about the properties those concepts should have. If we can accurately translate our intuitive concepts into mathematical definitions/metrics, then they should satisfy the intuitively-expected properties. (Or else our intuitions are wrong, and a good definition/metric should convince us of that when the definition doesn’t satisfy an expected property.)
The key challenge here is to come up with a set of intuitive arguments which uniquely specify a particular definition/metric, exactly like a set of equations can uniquely specify a solution. If our arguments have “many solutions”, then there’s little reason to expect that the ad-hoc “solution” we chose actually corresponds to our intuitive concept. If our chosen definition/metric does not correspond to our intuitive concept, then even if our intuition is correct, it shouldn’t be too surprising if the definition/metric fails to have more general properties which we intuitively expect.
In short: if our arguments are not sufficient to uniquely nail down one definition/metric, then we lose our main reason to expect the definition/metric to be robust.
Thankyou to Adam Shimi for a conversation which led to this post.