That reminds me of a question about judging predictions: Is there any established method to say “x made n predictions, was underconfident / calibrated properly / overconfident and the quality of the predictions was z”?
Assuming the predictions are given as “x will happen (y% confidence)”.
It is easy to make 1000 unbiased predictions about lottery drawings, but this does not mean you are good in making predictions.
Yes: use a scoring rule to rate your predictions, giving you an overall evaluation of their quality. If you use, say, the Brier score, that admits decompositions into separate components, for instance “calibration” and “refinement”; if your “refinement” score was high on the lottery drawings, meaning that you’d assigned higher probabilities of winning to the people who did in fact win (as opposed to correctly calling the probabilities of winning overall), you’d be a suspect for game-rigging or psi powers. ;)
Interesting, thanks, but not exactly what I looked for. As an example, take a simplified lottery: 1 number is drawn out of 10. I can predict “number X will have a probability of 10%” 100 times in a row—this is correct, and will give a good score in all scoring rules. However, those predictions are not interesting.
If I make 100 predictions “a meteorite will hit position X tomorrow (10% confidence)” and 10% of them are correct, those predictions are very interesting—you would expect that I have some additional knowledge (for example, observed an approaching asteroid).
The difference between the examples is the quality of the predictions: Everybody can get correct (unbiased) 10%-predictions for the lottery, but getting enough evidence to make correct 10%-probabilities for asteroid impacts is hard—most predictions for those positions will be way lower.
Interesting, thanks, but not exactly what I looked for.
Help me understand what you’re describing? Below is a stab at working out the math (I’m horrible at math, I have to laboriously work things out with a bc-like program, but I’m more confident in my grasp of the concepts).
The salient feature of your meteorite predictions is location. We can score these forecasts exactly as GJP scores multiple-choice forecasts, as long as they’re well-specified. Let’s refine “hit position X” to “within 10 miles of X”. That translates to roughly a one in a million chance of calling the location correctly (surface area of the Earth divided by a 10-mile radius area is about 10 to the 6). We can make a similar calculation with respect to the probability that a meteorite hits at all; it comes out to roughly one per day on average, so we can simplify and assume exactly one hits every day.
So a forecast that “a meteorite will hit location X tomorrow at 10% confidence” is equivalent to dividing Earth into one million cells, each cell being one possible outcome in a multiple-outcome forecast, and putting 10% probability mass into one cell. Let’s say you distribute the remaining probability evenly among the 999,999 remaining cells. We can now compute your Brier loss function, the sum of squared errors.
Either the meteorite hits X, and your score is .81 (the penalty for predicting an event at 10% confidence that turns out to happen), plus epsilon times one million minus one for the other cells. Or the meteorite hits a different cell, and your Brier score is 1.01 minus epsilon: 1 minus epsilon for hitting a cell that you had predicted would be hit at a probability close to 0, plus .01 for failing to hit X, plus epsilon for failing to hit the other cells.
So, over 100 such events, the expected value of your score ranges from 81 if you have laser-like accuracy, to 101 if you’re just guessing at random. Intermediate values reflect intermediate accuracies. The range of scores is fairly narrow, because your probability mass isn’t very concentrated—only a 10% bump on the “jackpot” cell, the rest spread around the surface of the earth.
If any of the above is wrong (math-wise) or stupid, or misrepresents your model, I’d appreciate knowing. :)
To calculate the Brier score, you used >your< assumption that meteorites have a 1 in a million chance to hit a specfic area. What about events without a natural way to get those assumptions?
Let’s use another example:
Assume that I predict that neither Obama nor Romney will be elected with 95% confidence. If that prediction becomes true, it is amazing and indicates a high predictive power (especially if I make multiple similar predictions and most of them become true).
Assume that I predict that either Obama or Romney will be elected with 95% confidence. If that prediction becomes true, it is not surprising.
Where is the difference? The second event is expected by others. How can we quantify “difference to expectations of others” and include it in the score? Maybe with an additional weight—weight each prediction with the difference from the expectations of others (as mean of the log ratio or something like that).
If the objective is to get better scores than others, then that helps, though it’s not clear to me that it does so in any consistent way (in particular, the strategy to maximize your score and the strategy to get the best score with the highest probability may well be different, and one of them might involve mis-reporting your own degree of belief).
How can we quantify “difference to expectations of others” and include it in the score?
You’re getting this from the “refinement” part of the calibration/refinement decomposition of the Brier score. Over time, your score will end up much higher than others’ if you have better refinement (e.g. from “inside information”, or from a superior methodology), even if everyone is identically (perfectly) calibrated.
This is the difference between a weather forecast derived from looking at a climate model, e.g. I assign 68% probability to the proposition that the temperature today in your city is within one standard deviation of its average October temperature, and one derived from looking out the window.
ETA: what you say about my using an assumption is not correct—I’ve only been making the forecast well-specified, such that the way you said you allocated your probability mass would give us a proper loss function, and simplifying the calculation by using a uniform distribution for the rest of your 90%. You can compute the loss function for any allocation of probability among outcomes that you care to name—the math might become more complicated, is all. I’m not making any assumptions as to the probability distribution of the actual events. The math doesn’t, either. It’s quite general.
I can still make 100000 lottery predictions, and get a good score. I look for a system which you cannot trick in that way.
Ok, for each prediction, you can subtract the average score from your score. That should work. Assuming that all other predictions are rational, too, you get an expectation of 0 difference in the lottery predictions.
I’ve only been making the forecast well-specified
I think “impact here (10% confidence), no impact at that place (90% confidence)” is quite specific. It is a binary event.
That reminds me of a question about judging predictions: Is there any established method to say “x made n predictions, was underconfident / calibrated properly / overconfident and the quality of the predictions was z”? Assuming the predictions are given as “x will happen (y% confidence)”.
It is easy to make 1000 unbiased predictions about lottery drawings, but this does not mean you are good in making predictions.
Yes: use a scoring rule to rate your predictions, giving you an overall evaluation of their quality. If you use, say, the Brier score, that admits decompositions into separate components, for instance “calibration” and “refinement”; if your “refinement” score was high on the lottery drawings, meaning that you’d assigned higher probabilities of winning to the people who did in fact win (as opposed to correctly calling the probabilities of winning overall), you’d be a suspect for game-rigging or psi powers. ;)
Interesting, thanks, but not exactly what I looked for. As an example, take a simplified lottery: 1 number is drawn out of 10. I can predict “number X will have a probability of 10%” 100 times in a row—this is correct, and will give a good score in all scoring rules. However, those predictions are not interesting.
If I make 100 predictions “a meteorite will hit position X tomorrow (10% confidence)” and 10% of them are correct, those predictions are very interesting—you would expect that I have some additional knowledge (for example, observed an approaching asteroid).
The difference between the examples is the quality of the predictions: Everybody can get correct (unbiased) 10%-predictions for the lottery, but getting enough evidence to make correct 10%-probabilities for asteroid impacts is hard—most predictions for those positions will be way lower.
Help me understand what you’re describing? Below is a stab at working out the math (I’m horrible at math, I have to laboriously work things out with a bc-like program, but I’m more confident in my grasp of the concepts).
The salient feature of your meteorite predictions is location. We can score these forecasts exactly as GJP scores multiple-choice forecasts, as long as they’re well-specified. Let’s refine “hit position X” to “within 10 miles of X”. That translates to roughly a one in a million chance of calling the location correctly (surface area of the Earth divided by a 10-mile radius area is about 10 to the 6). We can make a similar calculation with respect to the probability that a meteorite hits at all; it comes out to roughly one per day on average, so we can simplify and assume exactly one hits every day.
So a forecast that “a meteorite will hit location X tomorrow at 10% confidence” is equivalent to dividing Earth into one million cells, each cell being one possible outcome in a multiple-outcome forecast, and putting 10% probability mass into one cell. Let’s say you distribute the remaining probability evenly among the 999,999 remaining cells. We can now compute your Brier loss function, the sum of squared errors.
Either the meteorite hits X, and your score is .81 (the penalty for predicting an event at 10% confidence that turns out to happen), plus epsilon times one million minus one for the other cells. Or the meteorite hits a different cell, and your Brier score is 1.01 minus epsilon: 1 minus epsilon for hitting a cell that you had predicted would be hit at a probability close to 0, plus .01 for failing to hit X, plus epsilon for failing to hit the other cells.
So, over 100 such events, the expected value of your score ranges from 81 if you have laser-like accuracy, to 101 if you’re just guessing at random. Intermediate values reflect intermediate accuracies. The range of scores is fairly narrow, because your probability mass isn’t very concentrated—only a 10% bump on the “jackpot” cell, the rest spread around the surface of the earth.
If any of the above is wrong (math-wise) or stupid, or misrepresents your model, I’d appreciate knowing. :)
To calculate the Brier score, you used >your< assumption that meteorites have a 1 in a million chance to hit a specfic area. What about events without a natural way to get those assumptions?
Let’s use another example:
Assume that I predict that neither Obama nor Romney will be elected with 95% confidence. If that prediction becomes true, it is amazing and indicates a high predictive power (especially if I make multiple similar predictions and most of them become true).
Assume that I predict that either Obama or Romney will be elected with 95% confidence. If that prediction becomes true, it is not surprising.
Where is the difference? The second event is expected by others. How can we quantify “difference to expectations of others” and include it in the score? Maybe with an additional weight—weight each prediction with the difference from the expectations of others (as mean of the log ratio or something like that).
If the objective is to get better scores than others, then that helps, though it’s not clear to me that it does so in any consistent way (in particular, the strategy to maximize your score and the strategy to get the best score with the highest probability may well be different, and one of them might involve mis-reporting your own degree of belief).
You’re getting this from the “refinement” part of the calibration/refinement decomposition of the Brier score. Over time, your score will end up much higher than others’ if you have better refinement (e.g. from “inside information”, or from a superior methodology), even if everyone is identically (perfectly) calibrated.
This is the difference between a weather forecast derived from looking at a climate model, e.g. I assign 68% probability to the proposition that the temperature today in your city is within one standard deviation of its average October temperature, and one derived from looking out the window.
ETA: what you say about my using an assumption is not correct—I’ve only been making the forecast well-specified, such that the way you said you allocated your probability mass would give us a proper loss function, and simplifying the calculation by using a uniform distribution for the rest of your 90%. You can compute the loss function for any allocation of probability among outcomes that you care to name—the math might become more complicated, is all. I’m not making any assumptions as to the probability distribution of the actual events. The math doesn’t, either. It’s quite general.
I can still make 100000 lottery predictions, and get a good score. I look for a system which you cannot trick in that way. Ok, for each prediction, you can subtract the average score from your score. That should work. Assuming that all other predictions are rational, too, you get an expectation of 0 difference in the lottery predictions.
I think “impact here (10% confidence), no impact at that place (90% confidence)” is quite specific. It is a binary event.