The VARIANCE of a random variable seems like one of those ad hoc metrics. I would be very happy for someone to come along and explain why I’m wrong on this. If you want to measure, as Wikipedia says, “how far a set of numbers is spread out from their average value,” why use E[ (X—mean)^2 ] instead of E[ |X—mean| ], or more generally E[ |X—mean|^p ]? The best answer I know of is that E[ (X—mean)^2 ] is easier to calculate than those other ones.
Variance has more motivation than just that it’s a measure of how spread out the distribution is. Variance has the property that if two random variables are independent, then the variance of their sum is the sum of their variances. By the central limit theorem, if you add up a sufficiently large number of independent and identically distributed random variables, the distribution you get is well-approximated by a distribution that depends only on mean and variance (or any other measure of spreadout-ness). Since it is the variance of the distributions you were adding together that determines this, variance is exactly the thing you care about if you want to know the degree of spreadout-ness of a sum of a large number of independent variables from the distribution. If you take any measure of how spread out a distribution is that doesn’t carry the same information as the variance, then it will fail to predict how spread out the sum of a large number of independent copies of the distribution is, by any measure.
Edit: On the subject of other possible measures of features of probability distributions, one could also make the same complaint about mean as a measure of the middle of a distribution, when there are possible alternatives like median. Again, a similar sort of argument can be used to identify mean as the best one in some circumstances. But if I were to define a measure of how spread out a distribution is as E[|X-m|] for some m, I would use m=median rather than m=mean. This is because m=median minimizes this expected absolute value (in fact, median can be defined this way), so this measures the minimal average distance every point in the distribution has to travel in order for them to all meet at one point (the median is the most efficient point for them to meet).
Good point about the central limit theorem. Two nitpicks, though.
By the central limit theorem, if you add up a sufficiently large number of independent and identically distributed random variables, the distribution you get is well-approximated by a distribution that depends only on mean and variance (or any other measure of spreadout-ness)
The “or any other measure of spreadout-ness” can be dropped here; viewing the normal distribution through the lens of either the principle of maximum entropy or sufficient statistics tells us that it is variance specifically which is relevant, and any spread-metric not isomorphic to variance will be a leaky abstraction. (Leaky meaning that it will not capture all the relevant information about the spread, whereas variance does capture all the information, in a formal sense: it’s a sufficient statistic.)
But if I were to define a measure of how spread out a distribution is as E[|X-m|] for some m, I would use m=median rather than m=mean. This is because m=median minimizes this expected absolute value (in fact, median can be defined this way)...
I don’t think this is right. Suppose I have a uniform distribution over a finite set of X-values. The value of m minimizing E[|X-m|] should change if I decrease the minimum X-value a lot, while leaving everything else constant, but the median would stay the same.
I think the measure which would produce median is E[1 − 2 I[X>m]], where I[.] is an indicator function?
The “or any other measure of spreadout-ness” can be dropped here
What I meant is that, if you restrict attention to normal distributions with a fixed mean, then any reasonable measure of how spread out it is (including any of the E[|x-mean|^p]) will be a sufficient statistic, because any such measure, in order to be reasonable, must increase as variance increases (for normal distributions), so this function can be inverted to recover the variance. In other words, any other such measure will indeed be isomorphic to variance when restricted to normal distributions.
The value of m minimizing E[|X-m|] should change if I decrease the minimum X-value a lot, while leaving everything else constant
This does not change the minimizer of E[|X-m|] because it increases E[|X-m|] by the same amount for every m>min(X).
In general, you can’t decrease E[|X-m|] by moving m from median to median-d for d>0 because, for x≥median (half the distribution), you increase |X-m| by d, and for the other half, you decrease |X-m| by at most d.
“Any other such measure will indeed be isomorphic to variance when restricted to normal distributions.”
It’s true, but you should not restrict to normal distributions in this context. It is possible to find some distributions X1 and X2 with different variances but same value E(|x-mean|^p) for p≠2. Then X1 and X2 looks the same to this p-variance, but their normalized sample average will converge to different normal distributions. Hence variance is indeed the right and only measure of spreadout-ness to consider when applying the central limit theorem.
That’s exactly what I was trying to say, not a disagreement with it. The only step where I claimed all reasonable ways of measuring spreadout-ness agree was on the result you get after summing up a large number of iid random variables, not the random variables that were being summed up.
Maybe entropic uncertainty (conjectured by Everett as part of his “Many Worlds” thesis, and proved by Hirschmann and Beckner) is along the lines of what you’re looking for. It’s a generalization of the Heisenberg uncertainty principle that applies even when the variance isn’t well defined.
The VARIANCE of a random variable seems like one of those ad hoc metrics. I would be very happy for someone to come along and explain why I’m wrong on this. If you want to measure, as Wikipedia says, “how far a set of numbers is spread out from their average value,” why use E[ (X—mean)^2 ] instead of E[ |X—mean| ], or more generally E[ |X—mean|^p ]? The best answer I know of is that E[ (X—mean)^2 ] is easier to calculate than those other ones.
Variance has more motivation than just that it’s a measure of how spread out the distribution is. Variance has the property that if two random variables are independent, then the variance of their sum is the sum of their variances. By the central limit theorem, if you add up a sufficiently large number of independent and identically distributed random variables, the distribution you get is well-approximated by a distribution that depends only on mean and variance (or any other measure of spreadout-ness). Since it is the variance of the distributions you were adding together that determines this, variance is exactly the thing you care about if you want to know the degree of spreadout-ness of a sum of a large number of independent variables from the distribution. If you take any measure of how spread out a distribution is that doesn’t carry the same information as the variance, then it will fail to predict how spread out the sum of a large number of independent copies of the distribution is, by any measure.
Edit: On the subject of other possible measures of features of probability distributions, one could also make the same complaint about mean as a measure of the middle of a distribution, when there are possible alternatives like median. Again, a similar sort of argument can be used to identify mean as the best one in some circumstances. But if I were to define a measure of how spread out a distribution is as E[|X-m|] for some m, I would use m=median rather than m=mean. This is because m=median minimizes this expected absolute value (in fact, median can be defined this way), so this measures the minimal average distance every point in the distribution has to travel in order for them to all meet at one point (the median is the most efficient point for them to meet).
Good point about the central limit theorem. Two nitpicks, though.
The “or any other measure of spreadout-ness” can be dropped here; viewing the normal distribution through the lens of either the principle of maximum entropy or sufficient statistics tells us that it is variance specifically which is relevant, and any spread-metric not isomorphic to variance will be a leaky abstraction. (Leaky meaning that it will not capture all the relevant information about the spread, whereas variance does capture all the information, in a formal sense: it’s a sufficient statistic.)
I don’t think this is right. Suppose I have a uniform distribution over a finite set of X-values. The value of m minimizing E[|X-m|] should change if I decrease the minimum X-value a lot, while leaving everything else constant, but the median would stay the same.
I think the measure which would produce median is E[1 − 2 I[X>m]], where I[.] is an indicator function?
What I meant is that, if you restrict attention to normal distributions with a fixed mean, then any reasonable measure of how spread out it is (including any of the E[|x-mean|^p]) will be a sufficient statistic, because any such measure, in order to be reasonable, must increase as variance increases (for normal distributions), so this function can be inverted to recover the variance. In other words, any other such measure will indeed be isomorphic to variance when restricted to normal distributions.
This does not change the minimizer of E[|X-m|] because it increases E[|X-m|] by the same amount for every m>min(X).
In general, you can’t decrease E[|X-m|] by moving m from median to median-d for d>0 because, for x≥median (half the distribution), you increase |X-m| by d, and for the other half, you decrease |X-m| by at most d.
I don’t agree with the argument on the variance :
“Any other such measure will indeed be isomorphic to variance when restricted to normal distributions.”
It’s true, but you should not restrict to normal distributions in this context. It is possible to find some distributions X1 and X2 with different variances but same value E(|x-mean|^p) for p≠2. Then X1 and X2 looks the same to this p-variance, but their normalized sample average will converge to different normal distributions. Hence variance is indeed the right and only measure of spreadout-ness to consider when applying the central limit theorem.
That’s exactly what I was trying to say, not a disagreement with it. The only step where I claimed all reasonable ways of measuring spreadout-ness agree was on the result you get after summing up a large number of iid random variables, not the random variables that were being summed up.
Ah, these make sense. Thanks.
Maybe entropic uncertainty (conjectured by Everett as part of his “Many Worlds” thesis, and proved by Hirschmann and Beckner) is along the lines of what you’re looking for. It’s a generalization of the Heisenberg uncertainty principle that applies even when the variance isn’t well defined.