That’s a very interesting question and it is unfortunate that it did not get more traction, because I think I could learn a lot by reading more answers. In no way what follows is a definitive answer, it is just my own take.
First, how can we settle who has been a better forecaster so far?
A naive answer would be, let’s pick the forecaster who has the lowest cross-entropy, i.e. the same way when we train a binary classifier which outputs probabilities, we pick the model which minimises the cross-entropy. I say this answer is naive because if we take the question at face value and we want to pick the best forecaster, I would say that 4 predictions of disparate events is barely enough information to make a conclusive decision. Relying on the model analogy again, we rarely can judge a model on a single prediction, we need an aggregate of those*.
What kind of argumentation can the first forecaster make to convince the other one that 42% is the ‘correct’ answer?
Each forecaster will have built their model their own way. In a black box approach, forecaster A could ask B to use his model to predict a series similar events for which we know the outcome, e.g. “predict the probability that man will land in the moon by 1970” and use those results as an argument to convince B that the model is miscalibrated. The other alternative is to have forecaster A examine B’s model and see which assumptions, parts of the model they disagree.
And what does this numerical value actually mean, as landing on Mars is not a repetitive random event nor it is a quantity which we can try measuring like the radius of Saturn?
One school of thought, and as stated by others, relate to bets. If I assign probability p to a future event, I am willing to sell a promise of paying $1 if the event takes place at the price of p dollars and I am also willing to be the one buying the promise (see the Dutch book argument). I don’t find this argument satisfying. While this view makes a lot of sense in prediction markets, I don’t think other practitioners would be backing the output of every single model with a bet. Also, the utility of money is non-linear which makes things more complicated.
In my day to day, I think of Bayesian stats as a formal theory (a powerful and useful one) which allows me to combine evidence and produces a score/a number quantifying how likely is a proposition to be true. On top of that, Cox’s theorem tells me that if I accept some assumptions, any theory for scoring statements will satisfy the axioms of probability, in other words, the equations of this theory are not arbitrary, they have some grounding. Not everybody agrees on the assumptions made by Cox’s theorem, if we reject some of them, we can end up with something different, e.g. Dempster-Shafer theory.
If one believes the 42% is a better estimation than 43%, how can it help making any choices in the future?
If both the forecasters are willing to back up their predictions with bets, there is an opportunity for arbitrage opportunity for a 3rd party. One can sell a promise to B for $0.43 and then buy a promise from A for $0.42. No matter what is the outcome, you will have positive benefit of $0.01.
From a scoring perspective, 1% difference at this range (close to 50%) doesn’t mean much. If range considered was closer to the extremes 0% or 100%, this difference would be considered more significant. E.g. one model assigns 0.1% and the other 1.1%, it might be worth examining the difference between both models that lead to this discrepancy.
*An example of a model that can be judged with a single prediction. Say that I have a coin and model predicts probability of tails 10−12. We flip the coin and tails come out. We are more inclined to think the model is wrong than to believe that we witnessed an extremely rare event.
That’s a very interesting question and it is unfortunate that it did not get more traction, because I think I could learn a lot by reading more answers. In no way what follows is a definitive answer, it is just my own take.
A naive answer would be, let’s pick the forecaster who has the lowest cross-entropy, i.e. the same way when we train a binary classifier which outputs probabilities, we pick the model which minimises the cross-entropy. I say this answer is naive because if we take the question at face value and we want to pick the best forecaster, I would say that 4 predictions of disparate events is barely enough information to make a conclusive decision. Relying on the model analogy again, we rarely can judge a model on a single prediction, we need an aggregate of those*.
Each forecaster will have built their model their own way. In a black box approach, forecaster A could ask B to use his model to predict a series similar events for which we know the outcome, e.g. “predict the probability that man will land in the moon by 1970” and use those results as an argument to convince B that the model is miscalibrated. The other alternative is to have forecaster A examine B’s model and see which assumptions, parts of the model they disagree.
One school of thought, and as stated by others, relate to bets. If I assign probability p to a future event, I am willing to sell a promise of paying $1 if the event takes place at the price of p dollars and I am also willing to be the one buying the promise (see the Dutch book argument). I don’t find this argument satisfying. While this view makes a lot of sense in prediction markets, I don’t think other practitioners would be backing the output of every single model with a bet. Also, the utility of money is non-linear which makes things more complicated.
In my day to day, I think of Bayesian stats as a formal theory (a powerful and useful one) which allows me to combine evidence and produces a score/a number quantifying how likely is a proposition to be true. On top of that, Cox’s theorem tells me that if I accept some assumptions, any theory for scoring statements will satisfy the axioms of probability, in other words, the equations of this theory are not arbitrary, they have some grounding. Not everybody agrees on the assumptions made by Cox’s theorem, if we reject some of them, we can end up with something different, e.g. Dempster-Shafer theory.
If both the forecasters are willing to back up their predictions with bets, there is an opportunity for arbitrage opportunity for a 3rd party. One can sell a promise to B for $0.43 and then buy a promise from A for $0.42. No matter what is the outcome, you will have positive benefit of $0.01.
From a scoring perspective, 1% difference at this range (close to 50%) doesn’t mean much. If range considered was closer to the extremes 0% or 100%, this difference would be considered more significant. E.g. one model assigns 0.1% and the other 1.1%, it might be worth examining the difference between both models that lead to this discrepancy.
*An example of a model that can be judged with a single prediction. Say that I have a coin and model predicts probability of tails 10−12. We flip the coin and tails come out. We are more inclined to think the model is wrong than to believe that we witnessed an extremely rare event.