It’s possibly worth fitting this into a broader framework. The median minimizes the sum of |x−m|. (So it’s the max-likelihood estimator if your tails are like exp(−|t|).) The mean minimizes the sum of |x−m|2. (So it’s the max-likelihood estimator if your tails are like exp(−t2), a normal distribution.)
What about other exponents? 1 and 2 are kinda standard cases; what about 0 or infinity? Or negative numbers? Let’s consider 0 first of all.|x−m|0 is zero if m=x and 1 otherwise. Minimizing the sum of these means maximizing the number of things equal to m. This is the mode! (We’ll continue to get the mode if we use negative exponents. In that case we’d better maximize the sum instead of minimizing it, of course.) As p increases without limit, minimizing the sum of |x−m|p gets closer and closer to minimizing max(|x−m|). If the 0-mean is the mode, the 1-mean is the median and the 2-mean is the ordinary mean, then the infinity-mean is midway between the max and min of your data. This one doesn’t get quite so much attention in stats class :-).
The median is famously more robust than the mean: it’s affected less by outliers. (This goes along with the fatter tails it assumes: if you assume a very thin-tailed distribution, then an outlying point is super-unlikely and you’re going to have to try very hard to make it less outlying.) The mode is more robust still, in that sense. The “infinity-mean” (note: these are my names, and so far as I know no one else uses them) is kinda the least robust average you can imagine, being affected *only* by the most outlying data points.
Yeah, thanks for this comment, I sorta skipped it because I didn’t want to write too much… or something. In retrospect I’m not sure I modelled curious readers well enough, I should’ve just left it in.
One thing I noticed that I’m not so sure about: A motivation you might have for |x−m|2 over |x−m|1 (i.e. mean over median) is that you want a summary statistic that always changes when the data points do. As you move from 1,2,3 to 1,2,4, the median doesn’t change but the mean does.
And yet, given that in |x−m|p with p rising it approaches the centre of the max and min, it’s curious to see that we’ve chosen p=2. We wanted a summary statistic that changed as the data did, but of all possible ones, changed the least with the data. We could’ve settled on any integer greater than 1, and we picked 2.
From a purely mathematical point of view I don’t see why the exponent should be an integer. But p=2 is preferred over all other real values because of the Central Limit Theorem.
I don’t think maximizing the sum of the negative exponents gets you the mode. If you use 0−p=0 then the supremum (infinity) is not attained, while if you use 0−p=∞ then the maximum (infinity) is attained at any data point. If you do it with a continuous distribution you get more sensible answers but the solution (which is intuitively the “point of greatest concentration”) is not necessarily unique.
It’s worth mentioning that when p>1 the p-mean is unique: this is because x↦|x−m|p is a convex function, the sum of convex functions is convex, and convex functions have unique minima.
I’m using 0−p=∞ and using the cheaty convention that e.g. 3⋅∞>2⋅∞. I think this is what you get if you regard a discrete distribution as a limit of continuous ones. If this is too cheaty, of course it’s fine just to stick with non-negative values of p.
Yeah, OK. It works but you need to make sure to take the limit in a particular way, e.g. convolution with a sequence of approximations to the identity. Also you need to assume that p>−1 since otherwise the statistic diverges even for the continuous distributions.
It’s possibly worth fitting this into a broader framework. The median minimizes the sum of |x−m|. (So it’s the max-likelihood estimator if your tails are like exp(−|t|).) The mean minimizes the sum of |x−m|2. (So it’s the max-likelihood estimator if your tails are like exp(−t2), a normal distribution.)
What about other exponents? 1 and 2 are kinda standard cases; what about 0 or infinity? Or negative numbers? Let’s consider 0 first of all.|x−m|0 is zero if m=x and 1 otherwise. Minimizing the sum of these means maximizing the number of things equal to m. This is the mode! (We’ll continue to get the mode if we use negative exponents. In that case we’d better maximize the sum instead of minimizing it, of course.) As p increases without limit, minimizing the sum of |x−m|p gets closer and closer to minimizing max(|x−m|). If the 0-mean is the mode, the 1-mean is the median and the 2-mean is the ordinary mean, then the infinity-mean is midway between the max and min of your data. This one doesn’t get quite so much attention in stats class :-).
The median is famously more robust than the mean: it’s affected less by outliers. (This goes along with the fatter tails it assumes: if you assume a very thin-tailed distribution, then an outlying point is super-unlikely and you’re going to have to try very hard to make it less outlying.) The mode is more robust still, in that sense. The “infinity-mean” (note: these are my names, and so far as I know no one else uses them) is kinda the least robust average you can imagine, being affected *only* by the most outlying data points.
The standard name for the “infinity-mean” is the midrange.
Yeah, thanks for this comment, I sorta skipped it because I didn’t want to write too much… or something. In retrospect I’m not sure I modelled curious readers well enough, I should’ve just left it in.
One thing I noticed that I’m not so sure about: A motivation you might have for |x−m|2 over |x−m|1 (i.e. mean over median) is that you want a summary statistic that always changes when the data points do. As you move from 1,2,3 to 1,2,4, the median doesn’t change but the mean does.
And yet, given that in |x−m|p with p rising it approaches the centre of the max and min, it’s curious to see that we’ve chosen p=2. We wanted a summary statistic that changed as the data did, but of all possible ones, changed the least with the data. We could’ve settled on any integer greater than 1, and we picked 2.
From a purely mathematical point of view I don’t see why the exponent should be an integer. But p=2 is preferred over all other real values because of the Central Limit Theorem.
A longer explanation with pictures can be found here—Mean, median, mode, a unifying perspective
I don’t think maximizing the sum of the negative exponents gets you the mode. If you use 0−p=0 then the supremum (infinity) is not attained, while if you use 0−p=∞ then the maximum (infinity) is attained at any data point. If you do it with a continuous distribution you get more sensible answers but the solution (which is intuitively the “point of greatest concentration”) is not necessarily unique.
It’s worth mentioning that when p>1 the p-mean is unique: this is because x↦|x−m|p is a convex function, the sum of convex functions is convex, and convex functions have unique minima.
I’m using 0−p=∞ and using the cheaty convention that e.g. 3⋅∞>2⋅∞. I think this is what you get if you regard a discrete distribution as a limit of continuous ones. If this is too cheaty, of course it’s fine just to stick with non-negative values of p.
Yeah, OK. It works but you need to make sure to take the limit in a particular way, e.g. convolution with a sequence of approximations to the identity. Also you need to assume that p>−1 since otherwise the statistic diverges even for the continuous distributions.