The AI forecaster is able to consistently outperform the crowd forecast on a sufficiently large number of randomly selected questions on a high-quality forecasting platform
Seeing how the crowd forecast routinely performs at a superhuman level itself, isn’t it an unfairly high bar to clear? Not invalidating the rest of your arguments – the methodological problems you point out are really bad – but before asking the question about superhuman performance it makes a lot of sense to fully agree on what superhuman performance really is.
(I also note that a high-quality forecasting platform suffers from self-selection by unusually enthusiastic forecasters, bringing up the bar further. However, I don’t believe this to be an actual problem because if someone is claiming “performance on par with humans” I would expect that to mean “enthusiastic humans”.)
As I understand it, the Metaculus crowd forecast performs as well as it does (relative to individual predictors) in part because it gives greater weight to more recent predictions. If “superhuman” just means “superhumanly up-to-date on the news”, it’s less impressive for an AI to reach that level if it’s also up-to-date on the news when its predictions are collected. (But to be confident that this point applies, I’d have to know the details of the research better.)
a few particularly enthusiastic (&smart) humans still perform at roughly this level (depending on how you measure performance), so you wouldn’t want it to be much lower, and
we only acknowledged that this is a fairly reasonable definition of superhuman performance—it’s authors in these papers who claimed that their models were (roughly) on par with, or better than the crowd forecast.
We made the deliberate choice of not getting too much into the details of what constitutes human-level/superhuman forecasting ability. We have a lot of opinions on this as well, but it is a topic for another post in order not to derail the discussion on what we think matters most here.
I think it is fair to say that Metaculus’ crowd forecast is not what would naively be thought of as a crowd average—the recency weighting does a lot of work, so a general claim that an individual AI forecaster (at say the 80th percentile of ability) is better than the human crowd is reasonable, unless specifically in the context of a Metaculus-type weighted forecast.
Seeing how the crowd forecast routinely performs at a superhuman level itself, isn’t it an unfairly high bar to clear? Not invalidating the rest of your arguments – the methodological problems you point out are really bad – but before asking the question about superhuman performance it makes a lot of sense to fully agree on what superhuman performance really is.
(I also note that a high-quality forecasting platform suffers from self-selection by unusually enthusiastic forecasters, bringing up the bar further. However, I don’t believe this to be an actual problem because if someone is claiming “performance on par with humans” I would expect that to mean “enthusiastic humans”.)
As I understand it, the Metaculus crowd forecast performs as well as it does (relative to individual predictors) in part because it gives greater weight to more recent predictions. If “superhuman” just means “superhumanly up-to-date on the news”, it’s less impressive for an AI to reach that level if it’s also up-to-date on the news when its predictions are collected. (But to be confident that this point applies, I’d have to know the details of the research better.)
I agree it’s a high bar, but note that
a few particularly enthusiastic (&smart) humans still perform at roughly this level (depending on how you measure performance), so you wouldn’t want it to be much lower, and
we only acknowledged that this is a fairly reasonable definition of superhuman performance—it’s authors in these papers who claimed that their models were (roughly) on par with, or better than the crowd forecast.
We made the deliberate choice of not getting too much into the details of what constitutes human-level/superhuman forecasting ability. We have a lot of opinions on this as well, but it is a topic for another post in order not to derail the discussion on what we think matters most here.
I think it is fair to say that Metaculus’ crowd forecast is not what would naively be thought of as a crowd average—the recency weighting does a lot of work, so a general claim that an individual AI forecaster (at say the 80th percentile of ability) is better than the human crowd is reasonable, unless specifically in the context of a Metaculus-type weighted forecast.