My question concerns the semantics of making future predictions such as ‘the probability of X winning the election is 70%’. There are programs, e.g., Good Judgment, IARPA ACE, aimed at excelling at this kind of predictions.
In the classic mathematical interpretation of probability we can say ‘the probability that one gets 8 heads in a row when flipping an unbiased coin is 1⁄256’. This statement can be derived mathematically from the assumption that the coin is unbiased and it can be verified empirically by performing the experiment iteratively and counting the number of successes. In the Bayesian statistics the things get slightly more fuzzy but still we can make a reasoning as follows. We model our knowledge as ‘the value of the radius of Saturn in km is a random variable with normal distribution with mean 60.000 and variance 0.1’. Then we make several independent measurements, possibly burdened with some inaccuracy, and refine our prior distribution to have mean = 59.300 and variance = 0.01. This does not make sense in the previous interpretation but we can still attach some clear semantics to the sentence above by treating this random variable as a result of a measurement which is a repetitive random event. If one does not agree with such a statement, they must either question the choice of the prior distribution or the mathematical derivation.
Now suppose that two forecasters were asked in 2018 the questions below.
a) Will Donald Trump win the 2020 election?
b) Will USD/EUR exchange rate drop below 0.8 in 2020?
c) Will Sumatran orangutan become extinct by 2020?
d) Will humans land on Mars by 2040?
The first forecaster provided the following probability scores for these events: 43%, 90%, 45%, 42%. The other one gave numbers: 54%, 50%, 52%, 43%. We already know that the events (a,b,c) did not occur. First, how can we settle who has been a better forecaster so far? Secondly, their forecasts for the event (d) differ slightly. What kind of argumentation can the first forecaster make to convince the other one that 42% is the ‘correct’ answer? And what does this numerical value actually mean, as landing on Mars is not a repetitive random event nor it is a quantity which we can try measuring like the radius of Saturn? If one believes the 42% is a better estimation than 43%, how can it help making any choices in the future?
Probability is in the mind. It’s relative to the information you have.
In practical terms, you typically don’t have good enough resolution to get individual percentage point precision, unless it’s in a quantitative field with well understood processes.
The first forecaster thought it was less likely that 2 out of 3 things that didn’t occur - wouldn’t. The second forecaster thought it was more likely that 2 out of 3 things that didn’t occur—would. So I think that the first forecaster has got a pretty easy case on this one.
I think the rest of your questions seem to be thinking that the percentages are measuring something in the real world. They are a measure of the predictor’s confidence. A way to tell the world how seriously they think you should take their prediction.
I don’t think he can. He is technically a little less sure that humans that will land on the Mars than second forecaster. (or, if you prefer, a little more sure that they won’t) And a 1% difference is functionally 0 difference in this situation.
If they had vastly different levels of confidence, they could discuss the gaps in the optimism/pessimism, but at 1% difference....that’s just personal preference
To repeat self, They are a measure of the predictor’s confidence. A way to tell the world how seriously they think you should take their prediction.
Even if you had predictors with so many predictions that you could actually take a 1% difference seriously....I still don’t know when that 1% would matter much.
I find this question really interesting. I think the core of the issue is the first part:
I think a good approach would be betting related. I believe different reasonable betting schemes are possible, which in some cases will give conflicting answers when ranking forecasters. Here’s one reasonable setup:
Let A = probability the first forecaster, Alice, predicts for some event.
Let B = probability the second forecaster, Bob, assigns (suppose B > A wlog).
Define what’s called an option: basically a promissory note to pay 1 point if the event happens, and nothing otherwise.
Alice will write and sell N such options to Bob for price P each, with N and P to be determined.
Alice’s EV is positive if P > A (she expects pay out A points/option on average).
Bob’s EV is positive if P < B (he expects to be paid B points/option on average).
A specific scheme can then stipulate the way to determine N and P. After that comparing forecasters, after a number of events, would just translate to comparing points.
As a simple illustration (without claiming it’s great), here’s one possible scheme for P and N:
Alice and Bob split the difference and set P = 1⁄2 (A + B).
N = 1.
One drawback of that scheme is that it doesn’t punish too much a forecaster who erroneously assigns a probability of 0% or 100% to an event.
A different structure of the whole setup would involve not two forecasters betting against each other, but each forecaster betting against some “cosmic bookie”. I have some ideas how to make that work too.
I don’t see how we could assign some canonical meaning to this numerical value. For every forecaster there can always be a better one in principle, who takes into account more information, does more precise calculations, and happens to have better priors (until we reach the level of Laplace’s demon, at which point probabilities might just degenerate into 0 or 1).
If that’s true then such a numerical value would seem to just be a subjective property specific to a given forecaster, it’s whatever that forecaster assigns to the event and uses to estimate how many points (or whatever other metrics she cares about) she will have in the future.
I like the idea of defining a betting game ‘forecasters vs cosmic bookie’. Then saying ‘the probability that people will land on Mars by 2040 is 42%’ translates into semantics ‘I am willing to buy an option for Y<42 cents that would be worth $1 if we land on Mars by 2040 or $0 otherwise’.
To compare several forecasters we can consider a game in which each player is offered to buy some options of this kind. Suppose that for each x in {1, \dots, 99} each player is allowed to buy one option for x cents. If one believes that the probability of an event is 30% then it is profitable for them to buy the 29 cheapest options and nothing more (it does not matter if one buys the option for 30 cents or not).
To make the calculations simpler, we can make the prices continuous. So one is allowed to buy an option-interval [0,x] for some real x in [0,1]: by integration its price should be x22 and the pay-off is x if the event occurs. If the ‘true’ probability of the event is y then the expected profit equals yx−x22. One can easily see that if you know the value of y then the optimal strategy sets x=y. The larger mistake you make, the lower is your expected profit. The value of the game is the sum of all the profits and being a good forecaster means that one can design a strategy with high expected revenue.
An important drawback of this approach is that when you correctly estimate the probability of successful Mars landing to be 42%, then the optimal strategy gives expected profit 0.4222. However, if the question would be ‘what is the probability that people would FAIL to land on Mars by 2040?’, then the same knowledge gives you answer 58% and the expected profit is different: 0.5822. Hence, the bookie should also sell options that pays when the event does not occur or, equivalently, always consider each question together with its dual, i.e., the question about the event not happening. Now it begins to look like a proper mathematical formalization of forecasting.
Still, the problem remains that the choice of available options is arbitrary. Here I assumed that the prices are distributed uniformly in the interval [0,1] but one can consider some other distribution. The choice of the distribution governs how much you lose when you are off by 1% or 2%. The loss value is also different when you mistake 50% vs 51% and, e.g., 70% vs 71%. Tweaking the parameters of the distributions can change the result of any forecasting competition, but this should be fine as long as the parameters are known to the contestants.
That’s perfect, I was thinking along the same lines, with a range of options available for sale, but didn’t do the math and so didn’t realize the necessity of dual options. And you are right of course, there’s still quite a bit of arbitrariness left. In addition to varying the distribution of options there is, for example, freedom to choose what metric the forecasters are supposed to optimize. It doesn’t have to be EV, in fact in real life it rarely should be EV, because that ignores risk aversion. Instead we could optimize some utility function that becomes flatter for larger gains, for example we could use Kelly betting.
That’s a very interesting question and it is unfortunate that it did not get more traction, because I think I could learn a lot by reading more answers. In no way what follows is a definitive answer, it is just my own take.
A naive answer would be, let’s pick the forecaster who has the lowest cross-entropy, i.e. the same way when we train a binary classifier which outputs probabilities, we pick the model which minimises the cross-entropy. I say this answer is naive because if we take the question at face value and we want to pick the best forecaster, I would say that 4 predictions of disparate events is barely enough information to make a conclusive decision. Relying on the model analogy again, we rarely can judge a model on a single prediction, we need an aggregate of those*.
Each forecaster will have built their model their own way. In a black box approach, forecaster A could ask B to use his model to predict a series similar events for which we know the outcome, e.g. “predict the probability that man will land in the moon by 1970” and use those results as an argument to convince B that the model is miscalibrated. The other alternative is to have forecaster A examine B’s model and see which assumptions, parts of the model they disagree.
One school of thought, and as stated by others, relate to bets. If I assign probability p to a future event, I am willing to sell a promise of paying $1 if the event takes place at the price of p dollars and I am also willing to be the one buying the promise (see the Dutch book argument). I don’t find this argument satisfying. While this view makes a lot of sense in prediction markets, I don’t think other practitioners would be backing the output of every single model with a bet. Also, the utility of money is non-linear which makes things more complicated.
In my day to day, I think of Bayesian stats as a formal theory (a powerful and useful one) which allows me to combine evidence and produces a score/a number quantifying how likely is a proposition to be true. On top of that, Cox’s theorem tells me that if I accept some assumptions, any theory for scoring statements will satisfy the axioms of probability, in other words, the equations of this theory are not arbitrary, they have some grounding. Not everybody agrees on the assumptions made by Cox’s theorem, if we reject some of them, we can end up with something different, e.g. Dempster-Shafer theory.
If both the forecasters are willing to back up their predictions with bets, there is an opportunity for arbitrage opportunity for a 3rd party. One can sell a promise to B for $0.43 and then buy a promise from A for $0.42. No matter what is the outcome, you will have positive benefit of $0.01.
From a scoring perspective, 1% difference at this range (close to 50%) doesn’t mean much. If range considered was closer to the extremes 0% or 100%, this difference would be considered more significant. E.g. one model assigns 0.1% and the other 1.1%, it might be worth examining the difference between both models that lead to this discrepancy.
*An example of a model that can be judged with a single prediction. Say that I have a coin and model predicts probability of tails 10−12. We flip the coin and tails come out. We are more inclined to think the model is wrong than to believe that we witnessed an extremely rare event.
This is a topic I have found myself thinking a lot lately as well. I have found it useful to decompose a non-repeatable event (will X will the elections?) in two parts: one consisting of a combination of repeatable events and a “specific residual”.
Let’s start with a coin toss. It is, in a very Heraclitean sense, a one time event which we can decompose as a throw of an ideal coin plus a tiny, negligible “specific residual”.
Let’s go back to the problems in question. You could decompose them into combinations of events for which we have historical frequencies (how many times an incumbent politician...? how many times an election during an economic crises...? how do the probabilities of wining an election relate to the polls three months before...=), plus conceivably larger “specific residual” given the particularities of the question.
This approach is more useful than vague considerations on “probability is in your head” or “it just relates to information”. It is actually how predictors work: decomposing the question into subquestions on which frequency considerations are easier to elicit, recombining them, and adding an extra layer on uncertainty on top.