Prediction evaluations may be best when minimally novel
Imagine a prediction pipeline is resolved with a human/judgemental evaluation. For instance, a group today starts predicting what a trusted judge 10 years from now will say for the question, “How much counterfactual GDP benefit did policy X make, from 2020-2030?”
So, there are two stages:
Prediction
Evaluation
One question for the organizer of such a system is how many resources to delegate to the prediction step vs. the evaluation step. It could be expensive to both pay for predictors and evaluators, so it’s not clear how to weigh these steps against each other.
I’ve been suspecting that there are methods to be stingy with regards to the evaluators, and I have a better sense now why that is the case.
Imagine a model where the predictors gradually discover information I_predictors about I_total, the true ideal information needed to make this estimate. Imagine that they are well calibrated, and use the comment sections to express their information when predicting.
Later the evaluator comes by. Because they could read everything so far, they start with I_predictors. They can use this to calculate Prediction(I_predictors), although this should have already been estimated from the previous predictors (a la the best aggregate).
At this point the evaluator can choose to attempt to get more information, I_evaluation > I_predictors. However, if they do, the resulting probability distribution would be predicted by Prediction(I_predictors). Insofar as the predictors are concerned, the expected value of Prediction(I_evaluation) should be the same as that of Prediction(I_predictors), assuming that Prediction(I_predictors) is calibrated; except for the fact that it will be have more risk/randomness. Risk is generally not a desirable property. I’ve written about similar topics in this post.
Therefor, the predictors should generally prefer Prediction(I_predictors) to Prediction(I_evaluator), as long as everyone’s predictions are properly calibrated. This difference shouldn’t generally lead to a difference of predictions from them unless a complex or odd scoring rule were used.
Of course, calibration can’t be taken for granted. So pragmatically, the evaluator would likely have to deal with issues of calibration.
This setup also assumed that maximally useful comments are made available to evaluator. I think predictors will generally want the evaluator to see much of their information, as it would in general support their sides.
A relaxed version of this may be that the evaluators’ duty would be to get approximately all the information that the predictors had access to, but more is not necessary.
Note that this model is only interested in the impact of good evaluation on the predictions. Evaluation also would lead to “externalities”; information that would be useful in other ways as well. This information isn’t included here, but I’m fine with that. I think we should generally expect predictors to be more cost-effective than evaluators at doing “prediction work” (i.e. the main reason we have separated anyway!)
TLDR
The role of evaluation could be to ensure that predictions were reasonably calibrated and that the aggregation thus did a decent job. Evaluators shouldn’t don’t have to outperform the aggregate, if that requires outside information from what was used in the predictions.
Prediction evaluations may be best when minimally novel
Imagine a prediction pipeline is resolved with a human/judgemental evaluation. For instance, a group today starts predicting what a trusted judge 10 years from now will say for the question, “How much counterfactual GDP benefit did policy X make, from 2020-2030?”
So, there are two stages:
Prediction
Evaluation
One question for the organizer of such a system is how many resources to delegate to the prediction step vs. the evaluation step. It could be expensive to both pay for predictors and evaluators, so it’s not clear how to weigh these steps against each other.
I’ve been suspecting that there are methods to be stingy with regards to the evaluators, and I have a better sense now why that is the case.
Imagine a model where the predictors gradually discover information I_predictors about I_total, the true ideal information needed to make this estimate. Imagine that they are well calibrated, and use the comment sections to express their information when predicting.
Later the evaluator comes by. Because they could read everything so far, they start with I_predictors. They can use this to calculate Prediction(I_predictors), although this should have already been estimated from the previous predictors (a la the best aggregate).
At this point the evaluator can choose to attempt to get more information, I_evaluation > I_predictors. However, if they do, the resulting probability distribution would be predicted by Prediction(I_predictors). Insofar as the predictors are concerned, the expected value of Prediction(I_evaluation) should be the same as that of Prediction(I_predictors), assuming that Prediction(I_predictors) is calibrated; except for the fact that it will be have more risk/randomness. Risk is generally not a desirable property. I’ve written about similar topics in this post.
Therefor, the predictors should generally prefer Prediction(I_predictors) to Prediction(I_evaluator), as long as everyone’s predictions are properly calibrated. This difference shouldn’t generally lead to a difference of predictions from them unless a complex or odd scoring rule were used.
Of course, calibration can’t be taken for granted. So pragmatically, the evaluator would likely have to deal with issues of calibration.
This setup also assumed that maximally useful comments are made available to evaluator. I think predictors will generally want the evaluator to see much of their information, as it would in general support their sides.
A relaxed version of this may be that the evaluators’ duty would be to get approximately all the information that the predictors had access to, but more is not necessary.
Note that this model is only interested in the impact of good evaluation on the predictions. Evaluation also would lead to “externalities”; information that would be useful in other ways as well. This information isn’t included here, but I’m fine with that. I think we should generally expect predictors to be more cost-effective than evaluators at doing “prediction work” (i.e. the main reason we have separated anyway!)
TLDR
The role of evaluation could be to ensure that predictions were reasonably calibrated and that the aggregation thus did a decent job. Evaluators shouldn’t don’t have to outperform the aggregate, if that requires outside information from what was used in the predictions.