Aggregating forecasts
[EDIT: I have written an update of this post here]
Epistemic status: confused. I also seem to have spent some time thinking about it and still went with my gut without having a great reason for it.
When you ask several people or build several models to forecast the probability of an event you are left with the question of how to aggregate them to harness their collective wisdom.
Arguably, the best possible thing would be to examine the divergence in opinions and try to update towards a shared view. But this is not always possible—sharing models is hard and time consuming. How can we then aggregate forecasts naively?
I have been interested in this problem for a while, specially in the context of running forecasting workshops where my colleagues produce different estimations but we dont have enough time to discuss the differences.
To approach this problem I first lay my intuition, then skim two papers approaching this question and finally outline my best guess from what I learned.
I have a weak intution that taking the geometric mean is the best simple way to aggregate probabilities. When I try to inquire why do I believe that, my explanation is something like:
1) we want the aggregation procedure to be be simple (to not overfit), widely used (which is evidence of usefulness) and have a good theoretical justification
2) arithmetic means are simple, widely used and are the MLE estimator for the expected value of a normal distribution, which we expect to be common-ish based on the limit theorem
3) but summing probabilities is ontologically wrong / feels weird. p_1 + p_2 is not a probability.
4) a common approach in this kind of situation is to take the logarithm of your quantities of interest
5) the MLE of the expected value of a log normal is the geometric mean of the measures
So in practice I mostly use geometric means to aggregate forecasts and don’t feel to bad about it.
There is still a part of me that feels like the explanation above is too handwavy, so I have been reading a bit into the literature of aggregating forecasts to try to undertstand better the topic.
Satopaa et al write about this issue, by deriving an estimator of the best possible estimator based on some statistical assumptions and then testing it on synthetic and real data.
The estimator has the form
Where $p_i$ are the individual forecasts, $N$ is the number of forecasts and $a$ is a parameter indicating systematic bias in the invidual forecasts.
Interestingly, they show that the statistical assumptions about the distribution of forecasts are wrong in their real dataset. But their estimator outperforms in terms of the Brier score other simple estimators such as the mean and the median, and some fancy estimators like the logarithmic opinion pool and a beta transformed linear opinion pool.
This estimator takes the geometric mean of the odds instead of raw probabilities, and scales them up by a factor $a$. This factor is fitted to the data at hand, and they estimate that the value that minimizes the Brier score is $a \in [1.161, 3.921]$.
Sadly no empirical comparison with a geometric mean of probabilities is explored in the paper (if someone is interested in doing this, it would be a cool project to write up).
Digging a bit deeper I found this article by Allard et al surveying probability aggregation methods.
One of the cooler parts of their analysis is them assesing theoretical desiraderata a forecast aggregator must satisfy.
Some other desiderata are discussed and argued against, but the three that draw most my attention are external bayesianity, forcing and marginalization.
External bayesianity is satified when the pooling operation commutes with Bayesian updating—that is; new information should not affect the end result whether it is applied before or after the pooling.
The authors claim that it is a compelling property.
I find myself a bit confused about it. This property seems to talk about perfect Bayesians, is that too strong of an idealization? Doesnt the Bayesian update depend on the information already available to the forecasters (which we are abstracting away in the aggregation exercise) and thus be too restricting for our purposes?
Interestingly, the authors claim that the class of functions that satisfy external bayesianity are generalized weighted geometric means:
where $\sum w_i = 1$
Marginalization requires that the pooling operator commutes with marginalizing a joint probability distribution. This requires us to expand beyond the binary scenario to make sense.
The class of functions that satisfies marginalization are generalized weighted arithmetic means.
where $P_0$ is a constant.
The authors dont provide any commentary on how desirable marginalization is, albeit their writing suggests that Bayesian extenality is a more compelling property.
The authors also look into maximing the KL entropy under some tame constraints, and find that the maximizing pooling formula is:
More ground is covered in the paper, but I wont cover it.
So summarizing:
Satopaa et al find that the geometric mean of odds beats some other agrgegation methods
Allard et al argue that a generalized geometric mean of probabilities satisfies some desirable desiderata (external Bayesianity), but also study other desiderata that leads to other pooling functions
All in all, it seems like there are some credible alternatives, but I am still confused.
There is some empirical evidence that linear aggregation of probabilities is outperformed by other methods. The theoretical case is not clear cut, since linear aggregation still preserves some properties that seem desirable like marginalization, but fails at other desiderata like external Bayesianity.
But if not linear aggregation, what should we use? The two that stick out to me as credible candidates within the realm of simplicity are geometric aggregation of probabilites and geometric aggregation of odds.
I dont have a good reason for preferring one over the other—no empirical or theoretical case.
I would love to see a comparison of a geometric mean and the geometric mean of odds approach as in Satopaa et al, in either a simulation or a real dataset.
Ideally I would love an unambiguous answer motivating either of them or suggesting an alternative, but given the complexity of the papers I skimmed while writing this post I am moderately skeptical that this is going to happen anytime soon.
EDIT
I was talking of geometric mean of odds and geometric mean of probabilities as different things, but UnexpectedValues points outs that after a (neccessary) normalization they are one and the same.
So we can have the cake and eat it—Allard et al provide a theoretical reason of why you should use geometric mean of odds, and Satopaa et al provide an empirical reason to do so.
Now, the remaining questions are:
1) Are the theoretical reasons by Satopaa et al compelling (mainly external Bayesianity)?
2) Are there any credible alternatives that beat geometric aggregation in Allard et al’s dataset?
- When pooling forecasts, use the geometric mean of odds by 3 Sep 2021 9:58 UTC; 124 points) (EA Forum;
- What are some low-information priors that you find practically useful for thinking about the world? by 7 Aug 2020 4:38 UTC; 63 points) (EA Forum;
- [Link post] When pooling forecasts, use the geometric mean of odds by 6 Sep 2021 6:45 UTC; 8 points) (
- 1 Jan 2022 6:14 UTC; 3 points) 's comment on AnnaSalamon’s Shortform by (
(Edit: I may have been misinterpreting what you meant by “geometric mean of probabilities.” If you mean “take the geometric mean of probabilities of all events and then scale them proportionally to add to 1″ then I think that’s a pretty good method of aggregating probabilities. The point i make below is that the scaling is important.)
I think taking the geometric mean of odds makes more sense than taking the geometric mean of probabilities, because of an asymmetry arising from how the latter deals with probabilities near 0 versus probabilities near 1.
Concretely, suppose Alice forecasts an 80% chance of rain and Bob forecasts a 99% chance of rain. Those are 4:1 and 99:1 odds respectively, and if you take the geometric mean you’ll get an aggregate 95.2% chance of rain.
Equivalently, Alice and Bob are forecasting a 20% chance and a 1% chance of no rain—i.e. 1:4 and 1:99 odds. Taking the geometric mean of odds gives you a 4.8% chance of no rain—checks out.
Now suppose we instead take a geometric mean of probabilities. The geometric mean of 80% and 99% is roughly 89.0%, so aggregating Alice’s and Bob’s probabilities of rain in this way will give 89.0%.
On the other hand, aggregating Alice’s and Bob’s probabilities of no rain, i. e. taking a geometric mean of 20% and 1%, gives roughly 4.5%.
This means that there’s an inconsistency with this method of aggregation: you get an 89% chance of rain and a 4.5% chance of no rain.
Ohhhhhhhhhhhhhhhhhhhhhhhh
I had not realized, and this makes so much sense.
Geometric mean of the odds = mean of the evidences.
Suppose you have probabilities in odds form; 1: 2^a and 1:2^b, corresponding to a and b bits, respectively. Then the geometric mean of the odds is 1: sqrt(2^a * 2^b) = 1 : 2^((a+b)/2), corresponding to ((a+b)/2) bits; the midpoint in the evidences.
For some more background as to why bits are the natural unit of probability, see for example this arbital article, or search Probability Theory, the Logic of Science. Bits are additive: you can just add or substract bits as you encounter new evidence, and this is a pretty big “wink wink, nod, nod, nudge, nudge” as to why they’d be the natural unit.
In any case, if person A has seen a bits of evidence, of which a’ are unique, and person B has seen b bits of evidence, of which b’ are unique, and they have both seen s’ bits of shared evidence, then you’d want to add them, to end up at a’+b’+s’, or a + b −2s’. So maybe in practice (a+b)/2 = s’ + (a’+b’)/2 ~ a’+b’+s’, when a’ + b’ small (or overestimated, which imho seems to often be the case; people overestimate the importance of their own private information; there is also some literature on this).
This corresponds to the intuition that if someone is at 5%, and someone else is at 3% for totally unrelated reasons, the aggregate should be lower than that. And this would be a justification for Tetlock’s extremizing.
Anyways, in practice, you might estimate s’ as the historical base rate (to which you and your forecasters have access), and take a’ b’ as the deviation from that.
Thank you for pointing this out!
I have a sense that that log-odds are an underappreciated tool, and this makes me excited to experiment with them more—the “shared and distinct bits of evidence” framework also seems very natural.
On the other hand, if the Goddess of Bayesian evidence likes log odds so much, why did she make expected utility linear on probability? (I am genuinely confused about this)