Collective Error = Average Individual Error—Prediction Diversity
I think I’ve finally come up with a nice, mathematical way to drive a stake through the heart of that concept and bury it beneath a crossroads at midnight, though I fully expect that it shall someday rise again and shamble forth to eat the brains of the living.
Why should the bias-variance decomposition be relevant to modesty? Because, it seems to show, the error of averaging all the estimates together, is lower than the typical error of an individual estimate. Prediction Diversity (the variance) is positive when any disagreement exists at all, so Collective Error < Average Individual Error. But then how can you justify keeping your own estimate, unless you know that you did better than average? And how can you legitimately trust that belief, when studies show that everyone believes themselves to be above-average? You should be more modest, and compromise a little.
So what’s wrong with this picture?
To begin with, the bias-variance decomposition is a mathematical tautology. It applies when we ask a group of experts to estimate the 2007 close of the NASDAQ index. It would also apply if you weighed the experts on a pound scale and treated the results as estimates of the dollar cost of oil in 2020.
As Einstein put it, “Insofar as the expressions of mathematics refer to reality they are not certain, and insofar as they are certain they do not refer to reality.” The real modesty argument, Aumann’s Agreement Theorem, has preconditions; AAT depends on agents computing their beliefs in a particular way. AAT’s conclusions can be false in any particular case, if the agents don’t reason as Bayesians.
The bias-variance decomposition applies to the luminosity of fireflies treated as estimates, just as much as a group of expert opinions. This tells you that you are not dealing with a causal description of how the world works—there are not necessarily any causal quantities, things-in-the-world, that correspond to “collective error” or “prediction diversity”. The bias-variance decomposition is not about modesty, communication, sharing of evidence, tolerating different opinions, humbling yourself, overconfidence, or group compromise. It’s an algebraic tautology that holds whenever its quantities are defined consistently, even if they refer to the silicon content of pebbles.
More importantly, the tautology depends on a particular definition of “error”: error must go as the squared difference between the estimate and the true value. By picking a different error function, just as plausible as the squared difference, you can conjure a diametrically opposed recommendation:
The professor cleared his throat. “All right,” he said to the gathered students, “you’ve each handed in your written estimates of the value of this expression here,” and he gestured to a rather complex-looking string of symbols drawn on the blackboard. “Now it so happens,” the professor continued, “that this question contains a hidden gotcha. All of you missed in the same direction—that is, you all underestimated or all overestimated the true value, but I won’t tell you which. Now, I’m going to take the square root of the amount by which you missed the correct answer, and subtract it from your grade on today’s homework. But before I do that, I’m going to give you a chance to revise your answers. You can talk with each other and share your thoughts about the problem, if you like; or alternatively, you could stick your fingers in your ears and hum. Which do you think is wiser?”
Here we are taking the square root of the difference between the true value and the estimate, and calling this the error function, or loss function. (It goes without saying that a student’s utility is linear in their grade.)
And now, your expected utility is higher if you pick a random student’s estimate than if you pick the average of the class! The students would do worse, on average, by averaging their estimates together! And this again is tautologously true, by Jensen’s Inequality.
A brief explanation of Jensen’s Inequality:
(I strongly recommend looking at this graph while reading the following.)
Jensen’s Inequality says that if X is a probabilistic variable, F(X) is a function of X, and E[expr] stands for the probabilistic expectation of expr, then:
E[F(X)] ⇐ F(E[X]) if F is concave (second derivative negative) E[F(X)] >= F(E[X]) if F is convex (second derivative positive)
Why? Well, think of two values, x1 and x2. Suppose F is convex—the second derivative is positive, “the cup holds water”. Now imagine that we draw a line between x=x1, y=F(x1) and x=x2, y=F(x2). Pick a point halfway along this line. At the halfway point, x will equal (x1 + x2)/2, and y will equal (F(x1)+F(x2))/2. Now draw a vertical line from this halfway point to the curve—the intersection will be at x=(x1 + x2)/2, y=F((x1 + x2)/2). Since the cup holds water, the chord between two points on the curve is above the curve, and we draw the vertical line downward to intersect the curve. Thus F((x1 + x2)/2) < (F(x1) + F(x2))/2. In other words, the F of the average is less than the average of the Fs.
So:
If you define the error as the squared difference, F(x) = x^2 is a convex function, with positive second derivative, and by Jensen’s Inequality, the error of the average—F(E[X]) - is less than the average of the errors—E[F(X)]. So, amazingly enough, if you square the differences, the students can do better on average by averaging their estimates. What a surprise.
But in the example above, I defined the error as the square root of the difference, which is a concave function with a negative second derivative. Poof, by Jensen’s Inequality, the average error became less than the error of the average. (Actually, I also needed the professor to tell the students that they all erred in the same direction—otherwise, there would be a cusp at zero, and the curve would hold water. The real-world equivalent of this condition is that you think the directional or collective bias is a larger component of the error than individual variance.)
If, in the above dilemma, you think the students would still be wise to share their thoughts with each other, and talk over the math puzzle—I certainly think so—then your belief in the usefulness of conversation has nothing to do with a tautology defined over an error function that happens, in the case of squared error, to be convex. And it follows that you must think the process of sharing thoughts, of arguing differences, is not like averaging your opinions together; or that sticking to your opinion is not like being a random member of the group. Otherwise, you would stuff your fingers in your ears and hum when the problem had a concave error function.
When a line of reasoning starts assigning negative expected utilities to knowledge—offers to pay to avoid true information—I usually consider that a reductio.
The Error of Crowds
I’ve always been annoyed at the notion that the bias-variance decomposition tells us something about modesty or Philosophical Majoritarianism. For example, Scott Page rearranges the equation to get what he calls the Diversity Prediction Theorem:
I think I’ve finally come up with a nice, mathematical way to drive a stake through the heart of that concept and bury it beneath a crossroads at midnight, though I fully expect that it shall someday rise again and shamble forth to eat the brains of the living.
Why should the bias-variance decomposition be relevant to modesty? Because, it seems to show, the error of averaging all the estimates together, is lower than the typical error of an individual estimate. Prediction Diversity (the variance) is positive when any disagreement exists at all, so Collective Error < Average Individual Error. But then how can you justify keeping your own estimate, unless you know that you did better than average? And how can you legitimately trust that belief, when studies show that everyone believes themselves to be above-average? You should be more modest, and compromise a little.
So what’s wrong with this picture?
To begin with, the bias-variance decomposition is a mathematical tautology. It applies when we ask a group of experts to estimate the 2007 close of the NASDAQ index. It would also apply if you weighed the experts on a pound scale and treated the results as estimates of the dollar cost of oil in 2020.
As Einstein put it, “Insofar as the expressions of mathematics refer to reality they are not certain, and insofar as they are certain they do not refer to reality.” The real modesty argument, Aumann’s Agreement Theorem, has preconditions; AAT depends on agents computing their beliefs in a particular way. AAT’s conclusions can be false in any particular case, if the agents don’t reason as Bayesians.
The bias-variance decomposition applies to the luminosity of fireflies treated as estimates, just as much as a group of expert opinions. This tells you that you are not dealing with a causal description of how the world works—there are not necessarily any causal quantities, things-in-the-world, that correspond to “collective error” or “prediction diversity”. The bias-variance decomposition is not about modesty, communication, sharing of evidence, tolerating different opinions, humbling yourself, overconfidence, or group compromise. It’s an algebraic tautology that holds whenever its quantities are defined consistently, even if they refer to the silicon content of pebbles.
More importantly, the tautology depends on a particular definition of “error”: error must go as the squared difference between the estimate and the true value. By picking a different error function, just as plausible as the squared difference, you can conjure a diametrically opposed recommendation:
Here we are taking the square root of the difference between the true value and the estimate, and calling this the error function, or loss function. (It goes without saying that a student’s utility is linear in their grade.)
And now, your expected utility is higher if you pick a random student’s estimate than if you pick the average of the class! The students would do worse, on average, by averaging their estimates together! And this again is tautologously true, by Jensen’s Inequality.
A brief explanation of Jensen’s Inequality:
(I strongly recommend looking at this graph while reading the following.)
Jensen’s Inequality says that if X is a probabilistic variable, F(X) is a function of X, and E[expr] stands for the probabilistic expectation of expr, then:
Why? Well, think of two values, x1 and x2. Suppose F is convex—the second derivative is positive, “the cup holds water”. Now imagine that we draw a line between x=x1, y=F(x1) and x=x2, y=F(x2). Pick a point halfway along this line. At the halfway point, x will equal (x1 + x2)/2, and y will equal (F(x1)+F(x2))/2. Now draw a vertical line from this halfway point to the curve—the intersection will be at x=(x1 + x2)/2, y=F((x1 + x2)/2). Since the cup holds water, the chord between two points on the curve is above the curve, and we draw the vertical line downward to intersect the curve. Thus F((x1 + x2)/2) < (F(x1) + F(x2))/2. In other words, the F of the average is less than the average of the Fs.
So:
If you define the error as the squared difference, F(x) = x^2 is a convex function, with positive second derivative, and by Jensen’s Inequality, the error of the average—F(E[X]) - is less than the average of the errors—E[F(X)]. So, amazingly enough, if you square the differences, the students can do better on average by averaging their estimates. What a surprise.
But in the example above, I defined the error as the square root of the difference, which is a concave function with a negative second derivative. Poof, by Jensen’s Inequality, the average error became less than the error of the average. (Actually, I also needed the professor to tell the students that they all erred in the same direction—otherwise, there would be a cusp at zero, and the curve would hold water. The real-world equivalent of this condition is that you think the directional or collective bias is a larger component of the error than individual variance.)
If, in the above dilemma, you think the students would still be wise to share their thoughts with each other, and talk over the math puzzle—I certainly think so—then your belief in the usefulness of conversation has nothing to do with a tautology defined over an error function that happens, in the case of squared error, to be convex. And it follows that you must think the process of sharing thoughts, of arguing differences, is not like averaging your opinions together; or that sticking to your opinion is not like being a random member of the group. Otherwise, you would stuff your fingers in your ears and hum when the problem had a concave error function.
When a line of reasoning starts assigning negative expected utilities to knowledge—offers to pay to avoid true information—I usually consider that a reductio.