In particular, I don’t have any disagreement with the way the epistemic aggregation is being done; I just think there’s something suboptimal in the way the headline number (in this case, for a count-the-number-of-humans domain) is chosen and reported. And I worry that the median-ing leads to easily misinterpreted data.
For example, if a question asked “How many people are going to die from unaligned AI?”, and the community’s true belief was “40% to be everyone and 60% to be one person”, and that was reported as “the Metaculus community predicts 9,200 people will die from unaligned AI, 10% as many as die in fires per year”, that would...not be a helpful number at all.
You’re right that dates have their own nuance—whether it’s AGI or my food delivery, I care about the median arrival a lot more than the mean (but also, a lot about the tails!).
And so, in accordance with the ancient wisdom, I know that there’s something wrong here, and I don’t presume to be able to find the exact right fix. It seems most likely that there will have to be different handling for qualitatively different types of questions—a separation between “uncertainty in linear space, aggregated in linear space” (ex: Net migration to UK in 2021), “uncertainty in log space, importance in quantiles” (ex: AGI), “uncertainty in log space, importance in linear space” (ex: Monkeypox). The first two categories are already treated differently, so it seems possible for the third category to be minted as a new species of question.
Alternatively, much of the value could come from reporting means in addition to medians on every log question, so that the predictor and the consumer can each choose the numbers that they find most important to orient towards, and ignore the ones that are nonsensical. This doesn’t really solve the question of the incentives for predictors, but at least it makes the implications of their predictions explicit instead of obscured.
Agreed on all points!
In particular, I don’t have any disagreement with the way the epistemic aggregation is being done; I just think there’s something suboptimal in the way the headline number (in this case, for a count-the-number-of-humans domain) is chosen and reported. And I worry that the median-ing leads to easily misinterpreted data.
For example, if a question asked “How many people are going to die from unaligned AI?”, and the community’s true belief was “40% to be everyone and 60% to be one person”, and that was reported as “the Metaculus community predicts 9,200 people will die from unaligned AI, 10% as many as die in fires per year”, that would...not be a helpful number at all.
You’re right that dates have their own nuance—whether it’s AGI or my food delivery, I care about the median arrival a lot more than the mean (but also, a lot about the tails!).
And so, in accordance with the ancient wisdom, I know that there’s something wrong here, and I don’t presume to be able to find the exact right fix. It seems most likely that there will have to be different handling for qualitatively different types of questions—a separation between “uncertainty in linear space, aggregated in linear space” (ex: Net migration to UK in 2021), “uncertainty in log space, importance in quantiles” (ex: AGI), “uncertainty in log space, importance in linear space” (ex: Monkeypox). The first two categories are already treated differently, so it seems possible for the third category to be minted as a new species of question.
Alternatively, much of the value could come from reporting means in addition to medians on every log question, so that the predictor and the consumer can each choose the numbers that they find most important to orient towards, and ignore the ones that are nonsensical. This doesn’t really solve the question of the incentives for predictors, but at least it makes the implications of their predictions explicit instead of obscured.