Estimates vs. head-to-head comparisons

(Cross-posted from my blog.)

Summary: when choosing between two options, it’s not always optimal to estimate the value of each option and then pick the better one.

Suppose I am choosing between two actions, X and Y. One way to make my decision is to predict what will happen if I do X and predict what will happen if I do Y, and then pick the option which leads to the outcome that I prefer.

My predictions may be both vague and error-prone, and my value judgments might be very hard or nearly arbitrary. But it seems like I ultimately must make some predictions, and must decide how valuable the different outcomes are. So if I have to evaluate N options, I could do it by evaluating the goodness of each option, and then simply picking the option with the highest value. Right?

There are other possible procedures for evaluating which of two options is better. For example, I have often encountered advice of the form “if your error bars are too big, you should just ignore the estimate”. To be most extreme, I could choose some particular axis on which options can be better or worse, and then pick the option which is best on that axis, ignoring all others. (E.g., I could choose the option which is cheapest, or the charity which is most competently administered, or whatever.)

If you have an optimistic quantitative outlook like mine, this probably looks pretty silly—if one option is cheaper, that just gets figured into my estimate for how good it is. If my error bars are big, as long as I keep track of the error bars in my calculation it is still better than nothing. So why would I ever want to do anything other than estimate the value of each option?

In fact I don’t think my intuition is quite right. To see why, let’s start with a very simple case.

A simple model

Alice and Bob are picking between two interventions X and Y. They only have a year to make their decision, so they split up: Alice will produce an estimate of the value of X and Bob will produce an estimate of the value of Y, and they will both do whichever one looks better. Let’s suppose that Alice and Bob are perfectly calibrated and trust each other completely, so that each of them believes the other’s estimate to be unbiased.

Suppose that intervention X is good because it reduces carbon emissions. First Alice dutifully estimates the reductions in emissions that result from intervention X, call that number A1. Of course Alice doesn’t care about carbon emissions per se, she cares about the improvements in human quality of life that result from decreased emissions—and she couldn’t compare her estimate with Bob’s unless she converts it into units of goodness. So she next estimates the gain in quality of life per unit of reduced emissions, call that number A2. She then reports that the value of X is A1 * A2. Because she is unbiased, as long as her estimates of A1 and A2 are independent she obtains an unbiased estimate of the value of X.

Meanwhile, it happens to be the case that intervention Y is also good because it reduces carbon emissions. So Bob similarly estimates the reduction in carbon emissions from intervention Y, B1, and then the goodness of reduced emissions, B2, and reports B1 * B2. His estimate is also an unbiased estimate of the value of Y.

The pair decides to do intervention X iff it appears to have a higher value than Y, i.e. iff A1 * A2 > B1 * B2. This is not crazy but it’s also not a very good idea. It is easy to see that intervention X is better than intervention Y iff A1 > B1. But if estimates A2 and B2 are relatively noisy—especially if the noise in those estimates is larger than the actual gap between A1 and B1—then Alice and Bob will make an unnecessarily random decision.

What went wrong? Alice and Bob aren’t making a systematically bad decision, but they could have made a better decision by using a different technique for comparison. I think that a similar situation arises very often, in much less simple and slightly less severe situations. This may mean that the best way to compare X and Y is not always to compute the value for each. When making a comparison between X and Y, we can minimize uncertainty by making the analysis of X as similar to the analysis of Y as possible.

Objections

Of course this example was very simple, and there are lots of reasons you might expect more realistic estimates to be safe from these problems. I think that, despite all of these divergences, this simple model captures a common failure in estimation. The basic problem is that the argument above shows that there is no general reason to expect independent estimates of value to yield optimal results. Without a general reason to think that this procedure is optimal, it seems to be on much shakier ground. But to make the point, here are responses to some of the most obvious objections:

1. The reason we can say that Alice and Bob did badly is because we know something they didn’t—that A2 and B2 were estimates of the same quantity. Couldn’t they just have done one extra step of work—updating each of their estimates after looking at the other’s work—and avoided the problem?

In this case, that would have solved Alice and Bob’s problem. But in practice, different estimates rarely involve estimating exactly the same intermediates. If I want to compare the goodness of health interventions and education interventions in the developing world, the most natural estimates might not have even a single step in common. Nevertheless, each of those estimates would involve many uncertainties about social dynamics in the developing world, long-term global outcomes, and so on. I could do my analysis in a way that introduced analogies between the two estimates, and this could help me eliminate some of this uncertainty (even if the resulting estimates were noisier, or involved ignoring some apparently useful information).

If Alice and Bob’s estimates were related in a more complicated way, then it’s still the case that there is some extra update Alice and Bob could have done, which would have eliminated the problem (i.e. updating on each other’s estimates, using that relationship). But such an update could be quite complicated, and after making it Alice and Bob would need to make further updates still. In general, it’s not clear I can fix the problem without being logically omniscient. I don’t know the extent of this issue in practice, and I’m not familiar with a literature on this or related problems. It seems pretty messy in general, but I expect it would be possible to make meaningful headway on it.

The point is: in order to prove that comparing independent value estimates is optimal, it is not enough to assume that my beliefs are well-calibrated. I also need to assume that my beliefs make use of all available information (including having considered every alternative estimation strategy that sheds light on the question), which is unrealistic even for an idealized agent unless it is logically omniscient. When my beliefs don’t make use of all available information, other techniques for comparison might do better, including using different estimates which have more elements in common. (In some cases, even very simple approaches like “do the cheapest thing” will be predictably better than comparing independent value estimates.)

2. Alice and Bob had trouble because they are two different people. I agree that I shouldn’t compare estimates from different people, but if I do all of the estimates myself it seems like this isn’t a problem.

When I try to estimate the same thing several times, without remembering my earlier estimates, I tend to get different results. I strongly suspect this is universal, though I haven’t seen research on that question.

Moreover, when I try to estimate different things, my estimates tend not to obey the logical relationships that I know the estimated quantities must, unless I go back through with those particular relationship in mind and enforce them. For example, if I estimate A and B separately, the sum is rarely the same as if I estimated A+B. When the relationships amongst items are complicated, such consistency is unrealistically difficult to enforce. (Of course, the prospects for making comparisons also suffer.) It may be that there is some principled way to get around these problems, but I don’t know it.

Alice and Bob’s estimates don’t have to be very far from each other before they could have done better. I agree that estimates from a single person will have a higher degree of consistency than estimates from different people, but they won’t be consistent enough to remove the problem (or opportunity for improvement, if you want to look at it from a different angle).

3. The weird behavior in the example came from the artificial structure of the problem. How often could you do such factoring out for realistic estimates, even when they are similar?

If I’m trying to estimate the effect of different health interventions, the first step would be to separate the question “How much does this improve people’s health?” from “How much does improving people’s health matter?” That already factors out a big piece of the uncertainty. I think most people get that far, though, and so the question is: can you go farther?

I think it is still easier to estimate “Which of these interventions improve health more?” than to estimate the absolute improvement from either. We can break this comparison down into still smaller comparisons: “How many more or fewer people does X reach than Y?” and “Per person affected, what is the relative impact of X and Y?” etc. By focusing on the most important comparisons, and writing the others off as a wash, we might be able to reduce the total error in our comparison.

Conclusion

Trying to explicitly estimate the goodness of outcomes tends to draw a lot of criticism from pretty much every side. I think most of this criticism is unjustified (and often rooted in an aversion to making reasoning or motivations explicit, a desire to avoid offense or culpability, etc.). Nevertheless, there are problems with many straightforward approaches to quantitative estimation, and some qualitative processes improve on quantitative estimation in important ways. Many of these improvements are often dismissed by optimistic quantitative types (myself included), and I think that is an error. For example, I mentioned that I’ve often dismissed arguments of the form “If your error bars are too big, you are sometimes better off ignoring the data.” This looks obviously wrong on the Bayesian account, but as far as I can tell it may actually be the optimal behavior—even for idealized, bias-free humans.