This is good in some ways but also very misleading. This selects against people who also place a lot of forecasts on lots of questions, and also against people who place forecasts on questions that have already been open for a long time, and who don’t have time to later update on most of them.
I’d say it’s a very good way to measure performance within a tournament, but in the broader jungle of questions it misses an awful lot.
E.g. I have predictions on 1,114 questions, and the majority were never updated, and had negligible energy put into them.
Sometimes for fun I used to place my first (and only) forecast on questions that were just about to close. I liked it because this made it easier to compare my performance on distribution questions, versus the community, because the final summary would only show that for the final snapshot. But of course, if you do this then you will get very few points per question. But if I look at my results on those, it’s normal for me to slightly outperform the community median.
This isn’t captured by my average points per question across all questions, where I underperform (partly because I never updated on most of those questions, and partly because a lot of it is amusingly obscure stuff I put little effort into.) Though, that’s not to suggest I’m particular great either (I’m not), but I digress.
If we’re trying to predict a forecaster’s insight on “the next” given discrete prediction, then a more useful metric would be the forecaster’s log score versus the community’s log score on the same questions, at the time they placed those forecast. Naturally this isn’t a good way to score tournaments, where people should update often, and focus on high-effort per question. But if we’re trying to estimate their judgment from the broader jungle of Metaculus questions, then that would be much more informative than a points average per question.
This is good in some ways but also very misleading. This selects against people who also place a lot of forecasts on lots of questions, and also against people who place forecasts on questions that have already been open for a long time, and who don’t have time to later update on most of them.
I’d say it’s a very good way to measure performance within a tournament, but in the broader jungle of questions it misses an awful lot.
E.g. I have predictions on 1,114 questions, and the majority were never updated, and had negligible energy put into them.
Sometimes for fun I used to place my first (and only) forecast on questions that were just about to close. I liked it because this made it easier to compare my performance on distribution questions, versus the community, because the final summary would only show that for the final snapshot. But of course, if you do this then you will get very few points per question. But if I look at my results on those, it’s normal for me to slightly outperform the community median.
This isn’t captured by my average points per question across all questions, where I underperform (partly because I never updated on most of those questions, and partly because a lot of it is amusingly obscure stuff I put little effort into.) Though, that’s not to suggest I’m particular great either (I’m not), but I digress.
If we’re trying to predict a forecaster’s insight on “the next” given discrete prediction, then a more useful metric would be the forecaster’s log score versus the community’s log score on the same questions, at the time they placed those forecast. Naturally this isn’t a good way to score tournaments, where people should update often, and focus on high-effort per question. But if we’re trying to estimate their judgment from the broader jungle of Metaculus questions, then that would be much more informative than a points average per question.