If the things you’re predicting are completely independent, then naive “calibration” works fine: if you’re good at putting things into an “80% likely” bucket, then in practice ~80% of those predictions will be true.
If the things you’re predicting are highly correlated with each other—e.g. questions like “Will company X fail?”, “Will company Y fail?”, and so on, when the most likely way for company X to fail involves a general economic downturn that affects all the companies—then even if you were perfect at putting propositions into the 5% bucket, the actual outcomes may look a lot more like “0% became true” or “100% became true” than like “5% became true”.
Therefore, when evaluating someone’s calibration, or creating a set of predictions one plans to evaluate later, one should take these correlations into account.
If one expects correlated outcomes, probably the best thing is to factor out the correlated part into its own prediction—e.g. “Chance of overall downturn [i.e. GDP is below X or something]: 4%” and “Chance of company X failing, conditional on overall downturn: 70%” and “Chance of company X failing, conditional on no downturn: 2.3%” (which comes out to ~5% total).
If the predictor didn’t do this, but there was an obvious-in-retrospect common cause affecting many propositions… well, you still don’t know what probability the predictor would have assigned to that common cause, which is unfortunate, and makes it difficult to judge. Seems like the most rigorous thing you can do is pick one of the correlated propositions, and throw out the rest, so that the resulting set of propositions is (mostly) independent. If this leaves you with too few propositions to do good statistics with, that is unfortunate.
One might think that if you’re evaluating buckets separately (e.g. “the 80% bucket”, “the 90% bucket”), it’s ok if there’s a proposition in one bucket that’s correlated with a proposition in another bucket; as long as there’s no correlation within each bucket, it remains the case that, if the predictor was good, then ~80% of the propositions in the 80% bucket should be true. But then you can’t do a meta-evaluation at the end that combines the results of separate buckets: e.g. if they said “5% company X fails, 10% company Y fails, 15% company Z fails, 20% company Q fails”, and there was a downturn and they all failed, then saying “The predictor tended to be underconfident” would be illegitimate.
I guess I would summarize by saying:
If the things you’re predicting are completely independent, then naive “calibration” works fine: if you’re good at putting things into an “80% likely” bucket, then in practice ~80% of those predictions will be true.
If the things you’re predicting are highly correlated with each other—e.g. questions like “Will company X fail?”, “Will company Y fail?”, and so on, when the most likely way for company X to fail involves a general economic downturn that affects all the companies—then even if you were perfect at putting propositions into the 5% bucket, the actual outcomes may look a lot more like “0% became true” or “100% became true” than like “5% became true”.
Therefore, when evaluating someone’s calibration, or creating a set of predictions one plans to evaluate later, one should take these correlations into account.
If one expects correlated outcomes, probably the best thing is to factor out the correlated part into its own prediction—e.g. “Chance of overall downturn [i.e. GDP is below X or something]: 4%” and “Chance of company X failing, conditional on overall downturn: 70%” and “Chance of company X failing, conditional on no downturn: 2.3%” (which comes out to ~5% total).
If the predictor didn’t do this, but there was an obvious-in-retrospect common cause affecting many propositions… well, you still don’t know what probability the predictor would have assigned to that common cause, which is unfortunate, and makes it difficult to judge. Seems like the most rigorous thing you can do is pick one of the correlated propositions, and throw out the rest, so that the resulting set of propositions is (mostly) independent. If this leaves you with too few propositions to do good statistics with, that is unfortunate.
One might think that if you’re evaluating buckets separately (e.g. “the 80% bucket”, “the 90% bucket”), it’s ok if there’s a proposition in one bucket that’s correlated with a proposition in another bucket; as long as there’s no correlation within each bucket, it remains the case that, if the predictor was good, then ~80% of the propositions in the 80% bucket should be true. But then you can’t do a meta-evaluation at the end that combines the results of separate buckets: e.g. if they said “5% company X fails, 10% company Y fails, 15% company Z fails, 20% company Q fails”, and there was a downturn and they all failed, then saying “The predictor tended to be underconfident” would be illegitimate.