[Explanation of the math confusion] To solve the problem, think locally. At each possible number, you can move up or down, and see whether this increases or decreases the total cost. For example if you’re at 3 (which has cost 3) if you move up 1 to 4, you get cost of 6, and if you move down to 2 you get cost of 2.
One way you can think of the cost of 4 relative to 3, is that when you move up to 4 you move away 1 step from each of the three data points—so you increase the cost of three. You can think of moving from 3 to 2 as moving toward 2 data-points and away from 1, which is a net benefit.
Overall, the goal is to find a state where movement in either direction causes you to move away from more points than you move toward. And that will always be the central data point, which has n data points on either side, and yet as it moves toward n of them it moves away from n+1 (it moves off the data point is was standing on). And yes, if you have an even number of datapoints, then every point between the two central datapoints is a median.
It’s possibly worth fitting this into a broader framework. The median minimizes the sum of |x−m|. (So it’s the max-likelihood estimator if your tails are like exp(−|t|).) The mean minimizes the sum of |x−m|2. (So it’s the max-likelihood estimator if your tails are like exp(−t2), a normal distribution.)
What about other exponents? 1 and 2 are kinda standard cases; what about 0 or infinity? Or negative numbers? Let’s consider 0 first of all.|x−m|0 is zero if m=x and 1 otherwise. Minimizing the sum of these means maximizing the number of things equal to m. This is the mode! (We’ll continue to get the mode if we use negative exponents. In that case we’d better maximize the sum instead of minimizing it, of course.) As p increases without limit, minimizing the sum of |x−m|p gets closer and closer to minimizing max(|x−m|). If the 0-mean is the mode, the 1-mean is the median and the 2-mean is the ordinary mean, then the infinity-mean is midway between the max and min of your data. This one doesn’t get quite so much attention in stats class :-).
The median is famously more robust than the mean: it’s affected less by outliers. (This goes along with the fatter tails it assumes: if you assume a very thin-tailed distribution, then an outlying point is super-unlikely and you’re going to have to try very hard to make it less outlying.) The mode is more robust still, in that sense. The “infinity-mean” (note: these are my names, and so far as I know no one else uses them) is kinda the least robust average you can imagine, being affected *only* by the most outlying data points.
Yeah, thanks for this comment, I sorta skipped it because I didn’t want to write too much… or something. In retrospect I’m not sure I modelled curious readers well enough, I should’ve just left it in.
One thing I noticed that I’m not so sure about: A motivation you might have for |x−m|2 over |x−m|1 (i.e. mean over median) is that you want a summary statistic that always changes when the data points do. As you move from 1,2,3 to 1,2,4, the median doesn’t change but the mean does.
And yet, given that in |x−m|p with p rising it approaches the centre of the max and min, it’s curious to see that we’ve chosen p=2. We wanted a summary statistic that changed as the data did, but of all possible ones, changed the least with the data. We could’ve settled on any integer greater than 1, and we picked 2.
From a purely mathematical point of view I don’t see why the exponent should be an integer. But p=2 is preferred over all other real values because of the Central Limit Theorem.
I don’t think maximizing the sum of the negative exponents gets you the mode. If you use 0−p=0 then the supremum (infinity) is not attained, while if you use 0−p=∞ then the maximum (infinity) is attained at any data point. If you do it with a continuous distribution you get more sensible answers but the solution (which is intuitively the “point of greatest concentration”) is not necessarily unique.
It’s worth mentioning that when p>1 the p-mean is unique: this is because x↦|x−m|p is a convex function, the sum of convex functions is convex, and convex functions have unique minima.
I’m using 0−p=∞ and using the cheaty convention that e.g. 3⋅∞>2⋅∞. I think this is what you get if you regard a discrete distribution as a limit of continuous ones. If this is too cheaty, of course it’s fine just to stick with non-negative values of p.
Yeah, OK. It works but you need to make sure to take the limit in a particular way, e.g. convolution with a sequence of approximations to the identity. Also you need to assume that p>−1 since otherwise the statistic diverges even for the continuous distributions.
A geometric intuition I came up with while reading:
Take a number line, and put 1, 2, and 4 on it.
-1-2---4- You’re moving a pointer along this line, and trying to minimize its total distance to the data points: -1-2---4- ....^ Intuitively, throwing it somewhere near the middle of the line makes sense. But drop 2 out, and look what happens as you move it: -1-----4- ....^ .|--| ....|--| vs. -1-----4- ......^ .|----| ......||
The distance is the same either way! (Specifically, it’s the same for any point in between 1 and 4.)
This means we’re free to move our pointer only with respect to 2, so the best answer is to get a distance-from-2 of 0 by putting it directly on 2.
To generalize this to medians of larger data sets, imagine adding more pairs of points on the outside of the range—the total distance to those points will be the same, just as it was for 1 and 4.
[edit: formatting came out a bit ugly—monospace sections seem to eat multiple spaces when displayed but not in the editor for some reason?]
[Explanation of the math confusion] To solve the problem, think locally. At each possible number, you can move up or down, and see whether this increases or decreases the total cost. For example if you’re at 3 (which has cost 3) if you move up 1 to 4, you get cost of 6, and if you move down to 2 you get cost of 2.
One way you can think of the cost of 4 relative to 3, is that when you move up to 4 you move away 1 step from each of the three data points—so you increase the cost of three. You can think of moving from 3 to 2 as moving toward 2 data-points and away from 1, which is a net benefit.
Overall, the goal is to find a state where movement in either direction causes you to move away from more points than you move toward. And that will always be the central data point, which has n data points on either side, and yet as it moves toward n of them it moves away from n+1 (it moves off the data point is was standing on). And yes, if you have an even number of datapoints, then every point between the two central datapoints is a median.
Thanks to Buck Shlegeris for teaching me this.
It’s possibly worth fitting this into a broader framework. The median minimizes the sum of |x−m|. (So it’s the max-likelihood estimator if your tails are like exp(−|t|).) The mean minimizes the sum of |x−m|2. (So it’s the max-likelihood estimator if your tails are like exp(−t2), a normal distribution.)
What about other exponents? 1 and 2 are kinda standard cases; what about 0 or infinity? Or negative numbers? Let’s consider 0 first of all.|x−m|0 is zero if m=x and 1 otherwise. Minimizing the sum of these means maximizing the number of things equal to m. This is the mode! (We’ll continue to get the mode if we use negative exponents. In that case we’d better maximize the sum instead of minimizing it, of course.) As p increases without limit, minimizing the sum of |x−m|p gets closer and closer to minimizing max(|x−m|). If the 0-mean is the mode, the 1-mean is the median and the 2-mean is the ordinary mean, then the infinity-mean is midway between the max and min of your data. This one doesn’t get quite so much attention in stats class :-).
The median is famously more robust than the mean: it’s affected less by outliers. (This goes along with the fatter tails it assumes: if you assume a very thin-tailed distribution, then an outlying point is super-unlikely and you’re going to have to try very hard to make it less outlying.) The mode is more robust still, in that sense. The “infinity-mean” (note: these are my names, and so far as I know no one else uses them) is kinda the least robust average you can imagine, being affected *only* by the most outlying data points.
The standard name for the “infinity-mean” is the midrange.
Yeah, thanks for this comment, I sorta skipped it because I didn’t want to write too much… or something. In retrospect I’m not sure I modelled curious readers well enough, I should’ve just left it in.
One thing I noticed that I’m not so sure about: A motivation you might have for |x−m|2 over |x−m|1 (i.e. mean over median) is that you want a summary statistic that always changes when the data points do. As you move from 1,2,3 to 1,2,4, the median doesn’t change but the mean does.
And yet, given that in |x−m|p with p rising it approaches the centre of the max and min, it’s curious to see that we’ve chosen p=2. We wanted a summary statistic that changed as the data did, but of all possible ones, changed the least with the data. We could’ve settled on any integer greater than 1, and we picked 2.
From a purely mathematical point of view I don’t see why the exponent should be an integer. But p=2 is preferred over all other real values because of the Central Limit Theorem.
A longer explanation with pictures can be found here—Mean, median, mode, a unifying perspective
I don’t think maximizing the sum of the negative exponents gets you the mode. If you use 0−p=0 then the supremum (infinity) is not attained, while if you use 0−p=∞ then the maximum (infinity) is attained at any data point. If you do it with a continuous distribution you get more sensible answers but the solution (which is intuitively the “point of greatest concentration”) is not necessarily unique.
It’s worth mentioning that when p>1 the p-mean is unique: this is because x↦|x−m|p is a convex function, the sum of convex functions is convex, and convex functions have unique minima.
I’m using 0−p=∞ and using the cheaty convention that e.g. 3⋅∞>2⋅∞. I think this is what you get if you regard a discrete distribution as a limit of continuous ones. If this is too cheaty, of course it’s fine just to stick with non-negative values of p.
Yeah, OK. It works but you need to make sure to take the limit in a particular way, e.g. convolution with a sequence of approximations to the identity. Also you need to assume that p>−1 since otherwise the statistic diverges even for the continuous distributions.
A geometric intuition I came up with while reading:
Take a number line, and put 1, 2, and 4 on it.
-1-2---4-
You’re moving a pointer along this line, and trying to minimize its total distance to the data points:
-1-2---4-
....^
Intuitively, throwing it somewhere near the middle of the line makes sense. But drop 2 out, and look what happens as you move it:
-1-----4-
....^
.|--|
....|--|
vs.
-1-----4-
......^
.|----|
......||
The distance is the same either way! (Specifically, it’s the same for any point in between 1 and 4.)
This means we’re free to move our pointer only with respect to 2, so the best answer is to get a distance-from-2 of 0 by putting it directly on 2.
To generalize this to medians of larger data sets, imagine adding more pairs of points on the outside of the range—the total distance to those points will be the same, just as it was for 1 and 4.
[edit: formatting came out a bit ugly—monospace sections seem to eat multiple spaces when displayed but not in the editor for some reason?]