Average probabilities, not log odds
Let’s say you want to assign a probability to some proposition X. Maybe you think about what odds you’d accept bets at, and decide you’d bet on X at 1:99 odds against X, and you’d bet against X at 1:9 odds against X. This implies you think the probability of X is somewhere between 1% and 10%. If you wouldn’t accept bets in either direction at intermediate odds, how should you refine this interval to a point estimate for the probability of X? Or maybe you asked two experts, and one of them told you that X has a 10% probability of being true, and another told you that X has a 1% probability of being true. If you’re inclined to just trust the experts as you don’t know anything about the subject yourself, and you don’t know which expert to trust, how should you combine these into a point estimate for the probability of X?
One popular answer I’ve seen is to take the geometric mean of the odds ratios (or averaging the log odds). So in either of the above scenarios, the geometric mean of 1:9 and 1:99 is , so you would assign a probability of about 3.2% to X. I think this is a bad answer, and that a better answer would be to average the probabilities (so, in these cases, you’d average 1% and 10% to get a probability of 5.5% for X). Here are many reasons for this:
Probabilities must add to 1. The average log odds rule doesn’t do this. Let’s try an example. Let’s suppose you’ve got some event A, and you ask three experts what the probability of A is. Expert 1 tells you that A has probability 50%, while experts 2 and 3 both say that A has probability 25%. The geometric mean of 1:1, 1:3, and 1:3, is about 1:2.1, so we get an overall probability of 32.5%, just less than 1⁄3. But now consider two more events, B and C, such that exactly one of A, B, and C must be true. It turns out that expert 1 gives you a probability distribution 50% A, 25% B, 25% C, expert 2 gives you a probability distribution 25% A, 50% B, 25% C, and expert 3 gives you a probability distribution 25% A, 25% B, 50% C. The average log odds rule assigns a 32.5% probability to each of A, B, and C, even though you know one of them must occur. Or, put differently, the average log odds rule assigns probability 32.5% to A, 32.5% to B, and 67.5% to “A or B”, violating additivity of probabilities of disjoint events. Averaging probabilities assigns probability 1⁄3 to each of A, B, and C, as any rule for combining probability estimates which treats the experts interchangeably and treats the events interchangeably must.
There’s a clear model for why averaging probabilities is a reasonable thing to do under some assumptions: Let’s say you have various models that you can use for assigning probabilities to things, and you believe that one of these models is roughly correct, but you don’t know which one. Maybe you’re asking some experts for probabilities, and one of them gives you well-calibrated probabilities that take into account all available information, and the others’ probabilities don’t provide any further useful information, but you have no idea which expert is the good one, even after seeing the probabilities they give you. The appropriate thing to do here is average together the probabilities outputted by your models or experts. In contrast, there are no conditions under which average log odds is the correct thing to, because violating additivity of disjoint events is never the correct thing to do (see previous paragraph).
I will acknowledge that there are conditions under which averaging probabilities is also not a reasonable thing to do. For example, suppose some proposition X has prior probability 50%, and two experts collect independent evidence about X, and both of them update to assigning 25% probability to X. Since both of them acquired 3:1 evidence against X, and these sources of evidence are independent, combined this gives 9:1 evidence against X, and you should update to assigning 10% probability to X. The prior is important here; if the prior were 10% and both experts updated to 25%, then you should update to 50%. Of course, if you were tempted to use average log odds to combine probability estimates, you’re probably not in the kind of situation in which this makes sense, and combining two probability estimates into something that isn’t between the two probably isn’t intended behavior. If you think carefully about where the probabilities you want to combine together in some situation are coming from, and how they relate to each other, then you might be able to do something better than averaging the probabilities. But I maintain that, if you want a quick and dirty heuristic, averaging probabilities is a better quick and dirty heuristic than anything as senseless as averaging log odds.
Probably part of the intuition motivating something more like average log odds rather than average probabilities is that averaging probabilities seems to ignore extreme probabilities. If you average 10% and 0.0000000001%, you get 5%, same as if you average 10% and 0.1%. But 0.1% and 0.0000000001% are really different, so maybe they shouldn’t have almost exactly the same effect on the end result? If the source of that 0.0000000001% figure is considered trustworthy, then they wouldn’t have assigned such an extreme probability without a good reason, and 5% is an enormous update away from that. But first of all, it’s not necessarily true that the more extreme probabilities must have come from stronger evidence if the probabilities are being arrived at rationally; that depends on the prior. For example, suppose two experts are asked to provide a probability distribution over bitstrings of length 20 that will be generated from the next 20 flips of a certain coin. Expert 1 assigns probability 2^-20 to each bitstring. Expert 2 assigns probability 10% to the particular bitstring 00100010101111010000, and distributes probability evenly among the remaining bitstrings. In this case it is expert 2 who’s claiming to have some very interesting information about how this coin works, which they wouldn’t have claimed without good reason, even though they are assigning 10% probability to an event that expert 1 is assigning probability 2^-20 to. Second, what if the probabilities aren’t arrived at rationally? Probabilities are between 0 and 1, while log odds are between and , so when averaging a large number of probabilities together, no unreliable source can move the average too much, but when averaging a large number of log odds, an unreliable source can have arbitrarily large effects on the result. And third, probabilities, not log odds, are the correct scale to use for decision-making. If expert 1 says some event has probability 1%, or 3%, and expert 2 says the same event has probability 0.01% or 0.0000001%, then, if the event in question is important enough for you to care about these differences, the possibility that expert 1 has accounted for the reasons someone might give a very low probability and has good reason to give a much higher probability instead should be much more interesting to you than the hypothesis that expert 2 has good reason to give such a low probability, and the relatively large differences between the “1%” or “3%” that expert 1 might have told you shouldn’t be largely ignored and washed out in log odds averaging with expert 2.
Let’s go back to the example where you’re trying to get a probability out of the odds you’d be willing to bet at. I think it helps to think about why there would be a significant gap between the worst odds you’d accept a bet at and the worst odds you’d accept the opposite bet at. One reason is that someone else’s willingness to bet on something is evidence for it being true, so there should be some interval of odds in which their willingness to make the bet implies that you shouldn’t, in each direction. Even if you don’t think the other person has any relevant knowledge that you don’t, it’s not hard to be more likely to accept bets that are more favorable to you, so if the process by which you turn an intuitive sense of probability into a number is noisy, then if you’re forced to set odds that you’d have to take bets on either side of, even someone who knows nothing about the subject could exploit you on average. I think the possibility that adversaries can make ambiguity resolve against you disproportionately often is a good explanation for ambiguity aversion in general, since there are many situations, not just bets, where someone might have an opportunity to profit from your loss. Anyway, if the worst odds you’d be willing to bet on are bounds on how seriously you take the hypothesis that someone else knows something that should make you update a particular amount, and you want to get an actual probability, then you should average over probabilities you perhaps should end up at, weighted by how likely it is that you should end up at them. This is an arithmetic mean of probabilities, not a geometric mean of odds.
- A Bayesian Aggregation Paradox by 22 Nov 2021 10:39 UTC; 87 points) (
- ’Tis The Season of Change by 12 Dec 2021 14:02 UTC; 37 points) (EA Forum;
- 13 Nov 2021 3:09 UTC; 12 points) 's comment on When pooling forecasts, use the geometric mean of odds by (EA Forum;
I think it would be perhaps helpful to link to a few people advocating averaging log-odds rather than averaging probabilities, eg:
When pooling forecasts, use the geometric mean of the odds
My current best guess on how to aggregate forecasts
Personally, I see this question as being an empirical question. Which method works best?
In the cases I care about, both averaging log odds and taking a median far outperform taking a mean. (Fwiw Metaculus agrees that it’s a very safe bet too)
Your example about additivity of disjoint events is somewhat contrived. Averaging log-odds respects the probability for a given event summing to 1, but if you add some additional structure it might not make sense, I agree.
Averaging log-odds is exactly a Bayesian update, so presumably you’d accept there are some conditions under which average log odds is the correct thing to do...
Thanks for the links!
Contrived how? What additional structure do you imagine I added? In what sense do you claim that averaging log odds preserves additivity of probability for disjoint events in the face of an example showing that the straightforward interpretation of this claim is false?
It isn’t; you can tell because additivity of probability for disjoint events continues to hold after Bayesian updates. [Edit: Perhaps a better explanation for why it isn’t a Bayesian update is that it isn’t even the same type signature as a Bayesian update. A Bayesian update takes a probability distribution and some evidence, and returns a probability distribution. Averaging log-odds takes some finite set of probabilities, and returns a probability]. I’m curious what led you to believe this, though.
I did a Monte Carlo simulation for this on my own whose Python script you can find on Pastebin.
Consider the following model: there is a bounded martingale M taking values in [0,1] and with initial value 1/2. The exact process I considered was a Brownian motion-like model for the log odds combined with some bias coming from Ito’s lemma to make the sigmoid transformed process into a martingale. This process goes on until some time T and then the event is resolved according to the probability implied by M(T). You have n “experts” who all get to observe this martingale at some idiosyncratic random time sampled uniformly from [0,T], but the times themselves are unknown to them (and to you).
In this case if you knew the expert who had the most information, i.e. who had sampled the martingale at the latest time, you’d do best to just copy his forecast exactly. You don’t know this in this setup, but in general you should believe on average that more extreme predictions came at later times, and so you should somehow give them more weight. Because of this, averaging the log odds in this setup does better than averaging the probabilities across a wide range of parameter settings. Because in this setup the information sets of different experts are as far as possible from being independent, there would also be no sense in extremizing the forecasts in any way.
In practice, as confirmed by the simulation, averaging log odds seems to do better than averaging the forecasts directly, and the gap in performance gets wider as the volatility of the process M increases. This is the result I expected without doing any Monte Carlo to begin with, but it does hold up empirically, so there’s at least one case in which averaging the log odds is a better thing to do than averaging the means. Obviously you can always come up with toy examples to make any aggregation method look good, but I think modelling different experts as taking the conditional expectations of a martingale under different sigma algebras in the same filtration is the most obvious model.
Nope! If n=1, then you do know which expert has the most information, and you don’t do best by copying his forecast, because the experts in your model are overconfident. See my reply to ADifferentAnonymous.
But well-done constructing a model in which average log odds outperforms average probabilities for compelling reasons.
The experts in my model are designed to be perfectly calibrated. What do you mean by “they are overconfident”?
The probability of the event is the expected value of the probability implied by M(T). The experts report M(X) for a random variable X sampled uniformly in [0,T]. M(T) differs from M(X) by a Gaussian of mean 0, and hence, knowing M(X), the expected value of M(T) is just M(X). But we want the expected value of the probability implied by M(T), which is different from the probability implied by the expected value of M(T), because expected value does not commute with nonlinear functions. So an expert reporting the probability implied by M(X) is not well-calibrated, even though an expert reporting M(X) is giving an unbiased estimate of M(T).
I don’t know what you’re talking about here. You don’t need any nonlinear functions to recover the probability. The probability implied by M(T) is just M(T), and the probability you should forecast having seen M(X) is therefore
P(E|M(X))=E[1E|FX]=E[E[1E|FT]|FX]=E[M(T)|FX]=M(X)since M is a martingale.
I think you don’t really understand what my example is doing.M is not a Brownian motion and its increments are not Gaussian; it’s a nonlinear transform of a drift-diffusion process by a sigmoid which takes values in [0,1]. M itself is already a martingale so you don’t need to apply any nonlinear transformation to M on top of that in order to recover any probabilities.
The explicit definition is that you take an underlying drift-diffusion process Y following
dY=σ2(eY−1eY+1)dt+σdzand let M=1−1/(eY+1). You can check that this M is a martingale by using Ito’s lemma.
If you’re still not convinced, you can actually use my Python script in the original comment to obtain calibration data for the experts using Monte Carlo simulations. If you do that, you’ll notice that they are well calibrated and not overconfident.
Oh, you’re right, sorry; I’d misinterpreted you as saying that M represented the log odds. What you actually did was far more sensible than that.
That’s alright, it’s partly on me for not being clear enough in my original comment.
I think information aggregation from different experts is in general a nontrivial and context-dependent problem. If you’re trying to actually add up different forecasts to obtain some composite result it’s probably better to average probabilities; but aside from my toy model in the original comment, “field data” from Metaculus also backs up the idea that on single binary questions median forecasts or log odds average consistently beats probability averages.
I agree with SimonM that the question of which aggregation method is best has to be answered empirically in specific contexts and theoretical arguments or models (including mine) are at best weakly informative about that.
My heuristic for deciding what heuristic to use, when you’re going to do something quick-n-dirty: figure out what quantity you’re actually interested in, use means in its natural scale for point estimates, and transform back to your inputs.
How does this apply to some examples?
In your post, you’re talking quite a lot about bets. To a first approximation marginal utility is linear in marginal wealth, so usually this means the quantity we are actually interested in is linear in probability, and the correct heuristic is “arithmetic mean” (of probability).
In SimonM’s comment, we’re talking about probabilities directly. Forecasting. Usually that means what we care about is calibration or a proper scoring rule, so the natural scale is [0,1] or log-odds. Now the correct heuristic is “arithmetic mean” (of log-odds of probability).
What about your difficult examples?
We’re already doing quick-n-dirty things rather than anything rigorous. What I usually do when I want constraints and my summary statistics don’t satisfy them is to just go ahead and normalize, after which of course we get 1⁄3, 1⁄3, 1⁄3 by symmetry after any estimation.
Once we’re no longer just doing something quick-n-dirty, all, uh, bets, are off. But my heuristic is still to transform to the domain where what you care about is linear, do your somewhat-more-sophisticated dirty work, and transform back to get your point estimate.
Not sure what you mean by this. A proper scoring rule incentivizes the same results that deciding what odds you’d be indifferent to betting on at (against a gambler whose decisions carry no information about reality) does.
Counter-example:
9 out of 10 people give a 1:100,000,000 probability estimate of winning the lottery by picking random numbers, the last person gives a 1:10 estimate. Averaging the probabilities gives a 1:100 estimate, and you foolishly conclude these are great odds given how cheap lottery tickets are.
Yes, context matters. If you have background knowledge that the true probability is fairly well known but a few people are completely wrong then you should certainly not just average probabilities. Something like a trimmed median would be far better in that case.
On the other hand, some other questions may be of the sort where experts give much higher odds than most people. Maybe something like “what is the probability that within 12 months you are infected with a virus that includes the following base sequence”, where the 9 people look at the length of the given base sequence, estimate average virus genome size, and give odds on the order of 1:100,000,000. The tenth looked up a viral genome database and found that it’s in all known variants of SARS-CoV-2, and estimated 1:10 odds.
If you don’t know anything about the context, then you can’t distinguish these scenarios just based on the numbers in them. You can’t even reasonably say that there’s some underlying distribution of types of contexts and you can do some sort of average over them.
In your ABC example we rely on the background information that
P(A&B)=0
P(A&C)=0
P(B&C)=0
P(A or B or C)=1.
So the background information is that the events are mutually exclusive and exhaustive. But only then do probabilities need to add to one. It’s not a general fact that “probabilities add to 1”. So taking the geometric average does itself not violate any axioms of probability. We “just” need to update the three geometric averages on this background knowledge. Plausibly how this should be done in this case is to normalize them such that they add to one. (In the case of the arithmetic mean, updating on the background information plausibly wouldn’t change anything here, but that’s not the case for other possible background information.)
Of course this leads to the question: How should we perform such updates in general, i.e. for arbitrary background assumptions? I think how this is commonly done is via finding the distribution which maximizes entropy of the old distribution under the background information, or finding the distribution which minimizes the KL-divergence to the old distribution (I think these methods are equivalent). Of course this requires that we have such a distribution in the first place, rather than just three probabilities obtained by some averaging method.
But it is anyway a more general question (than the question of whether the geometric mean of the odds is better or the arithmetic mean of the probabilities): how should we “average” two or more probability distributions (rather than just two probabilities), assuming they come from equally reliable sources?
But going back to your example about extreme probabilities: Of course when we assume that sources have different reliability, then they would need to be weighted somehow differently. But the interesting case here is what the correct averaging method is for the simple case of equally reliable sources. In this simple case the geometric mean seems to make much more sense, since it doesn’t generally discount extreme probabilities. (Extreme probabilities can often be even more reliable than non-extreme ones. E.g. the probability that the Earth is hollow seems extremely low.)
Also here a quick comment:
If we assume that the prior was indeed important here then this makes sense, but if we assume that the prior was irrelevant (that they would have arrived at 25% even if their prior was e.g. 10% rather than 50%), then this doesn’t make sense. (Maybe they first assumed the probability of drawing a black ball from an urn was 50%, then they each independently created a large sample, and ~25% of the balls came out black. In this case the prior was mostly irrelevant.) We would need a more general description under which circumstances the prior is indeed important in your sense and justifies the multiplicative evidence aggregation you proposed.
Lastly:
This is a very interesting possible explanation for betting aversion without needing to assume, as usual, risk aversion! Or rather two explanations, one with an adversary with additional information and one without. But in the second case I don’t see how a noisy process for a probability estimate would lead to being “forced to set odds that you’d have to take bets on either side of, even someone who knows nothing about the subject could exploit you on average”. Though the first case seems really plausible. It is so simple I would assume you are not the first with this idea, but I have never heard of it before.
My problem with a forecast aggregation method that relies on renormalizing to meet some coherence constraints is that then the probabilities you get depend on what other questions get asked. It doesn’t make sense for a forecast aggregation method to give probability 32.5% to A if the experts are only asked about A, but have that probability predictably increase if the experts are also asked about B and C. (Before you try thinking of a reason that the experts’ disagreement about B and C is somehow evidence for A, note that no matter what each of the experts believe, if your forecasting method is mean log odds, but renormalized to make probabilities sum to 1 when you ask about all 3 outcomes, then the aggregated probability assigned to A can only go up when you also ask about B and C, never down. So any such defense would violate conservation of expected evidence.)
Any linear constraints (which are the things you get from knowing that certain Boolean combinations of questions are contradictions or tautologies) that are satisfied by each predictor will also be satisfied by their arithmetic mean.
That’s part of my point. Arithmetic mean of probabilities gives you a way of averaging probability distributions, as well as individual probabilities. Geometric mean of log odds does not.
In this example, the sources of evidence they’re using are not independent; they can expect ahead of time that each of them will observe the same relative frequency of black balls from the urn, even while not knowing in advance what that relative frequency will be. The circumstances under which the multiplicative evidence aggregation method is appropriate are exactly the circumstances in which the evidence actually is independent.
They make their bet direction and size functions of the odds you offer them in such a way that they bet more when you offer better odds. If you give the correct odds, then the bet ends up resolving neutrally on average, but if you give incorrect odds, then which direction you are off in correlates with how big a bet they make in such a way that you lose on average either way.
Taking that as a challenge, can we reverse-engineer a situation where this would be the correct thing to do?
We can first sidestep the additivity-of-disjoint-events problem by limiting the discussion to a single binary outcome.
Then we can fulfill the condition almost trivially by saying our input probabilities are produced by the procedure ‘take the true log odds, add gaussian noise, convert to probability’.
Is that plausible? Well, a Bayesian update is an additive shift to the log odds. So if your forecasters each independently make a bunch of random updates (and would otherwise be accurate), that would do it. A simple model is that the forecasters all have the same prior and a good sample of the real evidence, which would make them update to the correct posterior, except that each one also accepts N bits of fake evidence, each of which has a 50⁄50 chance of supporting X or ~X (and the fake evidence is independent between forecasters).
That’s not a good enough toy model to convince me to use average log odds for everything, but it is good enough that I’d accept it if average log odds seemed to work in a particular domain.
That doesn’t work, even in the case where the number of probability estimates you’re trying to aggregate together is one. The geometric mean of a set of one number is just that number, so the claim that average log odds is the appropriate way to handle this situation implies that if you are given one probability estimate from this procedure, the appropriate thing to do is take it literally, but this is not the case. Instead, you should try to adjust out the expected effect of the gaussian noise. The correct way to do this depends on your prior, but for simplicity and to avoid privileging any particular prior, let’s try using the improper prior such that seeing the probability estimate gives you no information on what the gaussian noise term was. Then your posterior distribution over the “true log odds” is the observed log odds estimate plus a gaussian. The expected value of the true log odds is, of course, the observed log odds estimate, but the expected value of the true probability is not the observed probability estimate; taking the expected value does not commute with applying nonlinear functions like converting between log odds and probabilities.
Oof, rookie mistake. I retract the claim that averaging log odds is ‘the correct thing to do’ in this case
Still—unless I’m wrong again—the average log odds would converge to the correct result in the limit of many forecasters, and the average probabilities wouldn’t? Making the post title bad advice in such a case?
(Though median forecast would do just fine)
The whole debate seems to be poorly founded. There are many ways to combine probability estimates that have different properties and are suited to different situations. There is no one way that works best for every purpose and context.
In some contexts, the best estimate won’t even be within the range of the individual estimates. For example, suppose there is some binary question with prior 1:1 odds. Three other people have independent evidence regarding the question and give odds of 1:2, 1:3, and 1:4 based on their own evidence. What is your best estimate for the odds? (Hint: it’s not anywhere between 1:2 and 1:4)
I agree, and in fact I already gave almost the same example in the original post. My claim was not that averaging probabilities is always appropriate, just that it is often reasonable, and average log odds never is.