The plots below are an example: if all you had was the median, you wouldn’t know these three datasets were different; but if all you had was the mean, you would know:
Take those three datasets, and shift them by (-mean) to generate 3 new datasets. Then suddenly the median is distinct on all 3 new datasets, but the mean the same.
It seems for any set of two datasets with identical (statistical parameter X) and different (statistical parameter Y), you can come up with a different set of two datasets with different (statistical parameter X) and identical (statistical parameter Y)[1], which seems rather symmetric?
There are differences between the median and mean; this does not appear to be a sound justification.
It’s a good observation that I hadn’t considered—thanks for sharing it. After thinking it over, I don’t think it’s important.
I don’t know quite how to say this succinctly, so I’ll say it in a rambling way: subtracting the mean is an operation precisely targeted to make the means the same. Of course mean-standardized distributions have the same means! In the family of all distributions, there will be a lot more with the same median than with the same mean (I can’t prove this properly, but it feels true). You’re right that, in the family of mean-standardized distributions, this isn’t true, but that’s a family of distributions specifically constructed to make it untrue. The behavior in that specific family isn’t very informative about the information content of these location parameters in general.
subtracting the mean is an operation precisely targeted to make the means the same.
Fair, but on the flipside moving all the data below the median a noisy but always-negative amount is an operation precisely targeted to make the medians the same.
(To be perfectly clear: there are clear differences between the median and mean in terms of robustness; I just don’t think this particular example thereof is a good one.)
I’d argue that moving the data like that is not as precise: “choose any data point right of the median, and add any amount to it” is a larger set of operations than “subtract the mean from the distribution”.
(Although: is there a larger class of operations than just subtracting the mean that result in identical means but different medians? If there were, that would damage my conception of robustness here, but I haven’t tried to think of how to find such a class of operations, if they exist.)
(Although: is there a larger class of operations than just subtracting the mean that result in identical means but different medians? If there were, that would damage my conception of robustness here, but I haven’t tried to think of how to find such a class of operations, if they exist.)
Subtracting the mean and then scaling the resulting distribution by any nonzero constant works.
Alternatively, if you have a distribution and want to turn it into a different distribution with the same mean but different median:
You can move two data points, one by X, the other by -X, so long as this results in a non-zero net number of crossings over the former median.
This is guaranteed to be the case with either any sufficiently large positive X or any sufficiently large negative X.
This is admittedly a one-dimensional subset of a 2-dimensional random space.
You can move three data points, one by X, one by Y, the last by -(X+Y), so long as this results in a non-zero net number of crossings over the former median.
This is guaranteed to be the case for large enough (absolute) values. (Unlike in the even-N case this always works for large X and/or Y, regardless of sign.)
This is admittedly a 2-dimensional subset of a 3-dimensional random space.
etc.
I’d argue that moving the data like that is not as precise: “choose any data point right of the median, and add any amount to it” is a larger set of operations than “subtract the mean from the distribution”.
Aha. Now you are getting closer to the typical notion of robustness!
Imagine taking a sample of N elements (...N odd, to keep things simple) from a distribution and applying a random infinitesimal perturbation. Say chosen uniformly from {−ϵ,0,ϵ} for every element in said distribution.
In order for the median to stay the same the median element must not change[1]. So we have a probability of 1/3rd that this doesn’t change the median. This scales as O(1).
In order for the mean to stay the same the resulting perturbation must have mean of 0 (and hence, must have a sum of zero). How likely is this? Well, this is just a lazy random walk. The resulting probability (in the large-N limit) is just[2]
I like that notion of robustness! I’m having trouble understanding the big-O behavior here because of the 1/N^2 term—does the decreasing nature of this function as N goes up mean the mean becomes more robust than the median for large N, or does the median always win for any N?
There’s a 1/3rd chance that the median does not change under an infinitesimal perturbation as I’ve defined it. There’s a Θ(N−1/2) chance that the mean does not change under an infinitesimal perturbation as I’ve defined it.
Or, to flip it around:
There’s a 2/3rds chance that the median does change under an infinitesimal perturbation as I’ve defined it. There’s a 1−Θ(N−1/2) chance that the mean does change under an infinitesimal perturbation as I’ve defined it.
As you increase the number of data points, the mean asymptotes towards ‘almost always’ changing under an infinitesimal perturbation, whereas the median stays at a 2/3rds[1] chance.
Minor self-nit: this was assuming an odd number of data points. That being said, the probability assuming an even number of data points (and hence mean-of-center-two-elements) actually works out to the same - {−ϵ,−ϵ},{−ϵ,0},{0,−ϵ},{0,ϵ},{ϵ,0},{ϵ,ϵ} all change the median, or 6⁄9 possibilities.
Take those three datasets, and shift them by (-mean) to generate 3 new datasets. Then suddenly the median is distinct on all 3 new datasets, but the mean the same.
It seems for any set of two datasets with identical (statistical parameter X) and different (statistical parameter Y), you can come up with a different set of two datasets with different (statistical parameter X) and identical (statistical parameter Y)[1], which seems rather symmetric?
There are differences between the median and mean; this does not appear to be a sound justification.
(Now, moving one datapoint on the other hand...)
At least ones with a symmetry (in this case translation symmetry).
It’s a good observation that I hadn’t considered—thanks for sharing it. After thinking it over, I don’t think it’s important.
I don’t know quite how to say this succinctly, so I’ll say it in a rambling way: subtracting the mean is an operation precisely targeted to make the means the same. Of course mean-standardized distributions have the same means! In the family of all distributions, there will be a lot more with the same median than with the same mean (I can’t prove this properly, but it feels true). You’re right that, in the family of mean-standardized distributions, this isn’t true, but that’s a family of distributions specifically constructed to make it untrue. The behavior in that specific family isn’t very informative about the information content of these location parameters in general.
Fair, but on the flipside moving all the data below the median a noisy but always-negative amount is an operation precisely targeted to make the medians the same.
(To be perfectly clear: there are clear differences between the median and mean in terms of robustness; I just don’t think this particular example thereof is a good one.)
I’d argue that moving the data like that is not as precise: “choose any data point right of the median, and add any amount to it” is a larger set of operations than “subtract the mean from the distribution”.
(Although: is there a larger class of operations than just subtracting the mean that result in identical means but different medians? If there were, that would damage my conception of robustness here, but I haven’t tried to think of how to find such a class of operations, if they exist.)
Subtracting the mean and then scaling the resulting distribution by any nonzero constant works.
Alternatively, if you have a distribution and want to turn it into a different distribution with the same mean but different median:
You can move two data points, one by X, the other by -X, so long as this results in a non-zero net number of crossings over the former median.
This is guaranteed to be the case with either any sufficiently large positive X or any sufficiently large negative X.
This is admittedly a one-dimensional subset of a 2-dimensional random space.
You can move three data points, one by X, one by Y, the last by -(X+Y), so long as this results in a non-zero net number of crossings over the former median.
This is guaranteed to be the case for large enough (absolute) values. (Unlike in the even-N case this always works for large X and/or Y, regardless of sign.)
This is admittedly a 2-dimensional subset of a 3-dimensional random space.
etc.
Aha. Now you are getting closer to the typical notion of robustness!
Imagine taking a sample of N elements (...N odd, to keep things simple) from a distribution and applying a random infinitesimal perturbation. Say chosen uniformly from {−ϵ,0,ϵ} for every element in said distribution.
In order for the median to stay the same the median element must not change[1]. So we have a probability of 1/3rd that this doesn’t change the median. This scales as O(1).
In order for the mean to stay the same the resulting perturbation must have mean of 0 (and hence, must have a sum of zero). How likely is this? Well, this is just a lazy random walk. The resulting probability (in the large-N limit) is just[2]
P[XN=0]≈√32πN
[3]
This scales as O(N−1/2).
Because this is an infinitesimal perturbation the probability that this changes which element is the median is ~zero.
https://math.stackexchange.com/a/1327363/246278 with n=3 and l=N
A wild π appeared![4]
I don’t know why I am so amused by π turning up in ‘random’ places.
I like that notion of robustness! I’m having trouble understanding the big-O behavior here because of the 1/N^2 term—does the decreasing nature of this function as N goes up mean the mean becomes more robust than the median for large N, or does the median always win for any N?
Ah, to be clear:
There’s a 1/3rd chance that the median does not change under an infinitesimal perturbation as I’ve defined it.
There’s a Θ(N−1/2) chance that the mean does not change under an infinitesimal perturbation as I’ve defined it.
Or, to flip it around:
There’s a 2/3rds chance that the median does change under an infinitesimal perturbation as I’ve defined it.
There’s a 1−Θ(N−1/2) chance that the mean does change under an infinitesimal perturbation as I’ve defined it.
As you increase the number of data points, the mean asymptotes towards ‘almost always’ changing under an infinitesimal perturbation, whereas the median stays at a 2/3rds[1] chance.
Minor self-nit: this was assuming an odd number of data points. That being said, the probability assuming an even number of data points (and hence mean-of-center-two-elements) actually works out to the same - {−ϵ,−ϵ},{−ϵ,0},{0,−ϵ},{0,ϵ},{ϵ,0},{ϵ,ϵ} all change the median, or 6⁄9 possibilities.
Gotcha—thanks.