Contra papers claiming superhuman AI forecasting

[Conflict of interest disclaimer: We are FutureSearch, a company working on AI-powered forecasting and other types of quantitative reasoning. If thin LLM wrappers could achieve superhuman forecasting performance, this would obsolete a lot of our work.]

Widespread, misleading claims about AI forecasting

Recently we have seen a number of papers – (Schoenegger et al., 2024, Halawi et al., 2024, Phan et al., 2024, Hsieh et al., 2024) – with claims that boil down to “we built an LLM-powered forecaster that rivals human forecasters or even shows superhuman performance”.

These papers do not communicate their results carefully enough, shaping public perception in inaccurate and misleading ways. Some examples of public discourse:

Ethan Mollick (>200k followers) tweeted the following about the paper Wisdom of the Silicon Crowd: LLM Ensemble Prediction Capabilities Rival Human Crowd Accuracy by Schoenegger et al.:

A post on Marginal Revolution with the title and abstract of the paper Approaching Human-Level Forecasting with Language Models by Halawi et al. elicits responses like

  • “This is something that humans are notably terrible at, even if they’re paid to do it. No surprise that LLMs can match us.”

  • “+1 The aggregate human success rate is a pretty low bar”

A Twitter thread with >500k views on LLMs Are Superhuman Forecasters by Phan et al. claiming that “AI […] can predict the future at a superhuman level” had more than half a million views within two days of being published.

The number of such papers on AI forecasting, and the vast amount of traffic on misleading claims, makes AI forecasting a uniquely misunderstood area of AI progress. And it’s one that matters.

What does human-level or superhuman forecasting mean?

“Human-level” or “superhuman” is a hard-to-define concept. In an academic context, we need to work with a reasonable operationalization to compare the skill of an AI forecaster with that of humans.

One reasonable and practical definition of a superhuman forecasting AI forecaster is

The AI forecaster is able to consistently outperform the crowd forecast on a sufficiently large number of randomly selected questions on a high-quality forecasting platform.[1]

(For a human-level forecaster, just replace “outperform” with “performs on par with”.)

Red flags for claims to (super)human AI forecasting accuracy

Our experience suggests there are a number of things that can go wrong when building AI forecasting systems, including:

  1. Failing to find up-to-date information on the questions. It’s inconceivable on most questions that forecasts can be good without basic information.

    • Imagine trying to forecast the US presidential election without knowing that Biden dropped out.

  2. Drawing on up-to-date, but low-quality information. Ample experience shows low quality information confuses LLMs even more than it confuses humans.

  3. Lack of high-quality quantitative reasoning. For a decent number of questions on Metaculus, good forecasts can be “vibed” by skilled humans and perhaps LLMs. But for many questions, simple calculations are likely essential. Human performance shows systematic accuracy nearly always requires simple models such as base rates, time-series extrapolations, and domain-specific numbers.

    • Imagine forecasting stock prices without having, and using, historical volatility.

  4. Retrospective, rather than prospective, forecasting (e.g. forecasting questions that have already resolved). The risk for leakage of data about the present into the forecast, either in the LLMs or in the information used in the forecast, is extremely hard to stamp out.

Points 1 and 2 could also be summarised as “not being good (enough) at information retrieval (IR)”. We believe that “being good at IR” is both

  • necessary for being good at forecasting (and thus)

  • easier than being good at forecasting.

So if an agent fails at the IR stage, even the smartest and the most rational entity will struggle to turn this into a good forecast. This is basically just a roundabout way of saying GIGO.

A similar argument can be made for quantitative reasoning being important.

In the following, we go through issues with the papers in detail.

Wisdom of the Silicon Crowd: LLM Ensemble Prediction Capabilities Rival Human Crowd Accuracy (Schoenegger et al., 2024)

A quick glance over the paper shows a couple of suspicious points:

  • The architectures tested have virtually no information retrieval (IR). More precisely, 9 out of 12 LLMs (over whose predictions they take the median to obtain the final forecast) have no IR whatsoever and the 3 remaining ones have ChatGPT-like access to the internet when generating their forecast in response to a single, static prompt. (When we tried their prompt in ChatGPT with a question like “Will Israel and Hamas make peace before the end of the year?”, GPT-4o didn’t even check whether they have already made peace.)
    Hence the aggregate forecast will usually not be aware of any recent developments that aren’t already in the LLMs’ memories.

  • The authors only looked at n=31 questions. But you need quite a large number of forecasts/​resolved questions to accurately determine whether forecaster A is better than forecaster B (see e.g. this post).

And indeed, upon a closer look, one sees that the paper’s titular claim, reiterated in the abstract (“the LLM crowd… is not statistically different from the human crowd”) is not at all supported by the study: In the relevant non-preregistered part of the paper, they introduce a notion of equivalence: Two sets of forecasters are equally good if their Brier scores differ by no more than 0.081.

A difference in Brier scores of ≤.081 may sound small, but what does it mean?

  • The human aggregate in the study (avg. Brier of .19) would, according to this definition, count as equivalent to a forecaster who has a Brier score of ≤ 0.271 (=.19 + .081)). In their study, the human aggregate would e.g. count as equivalent to a forecaster who always predicts 50% (resulting in a Brier score of .25)

    • In particular, this notion of equivalence is incompatible with their pre-registered result refuting their Null hypothesis 1, Study 1 (p3).

  • Being omniscient (i.e. knowing all the answers in advance, getting a Brier score of 0) would be equivalent to predicting ≈72% for every true and ≈28% for every false outcome (getting a Brier score of .081).

  • Tetlock’s claims about Superforecasters would be invalidated because Superforecaster aggregates (avg. Brier of .146) would be equivalent to aggregates from all GJO participants (avg. Brier of .195).

Approaching Human-Level Forecasting with Language Models (Halawi et al., 2024)

This paper is of high quality and by far the best paper out of these four. The methodology looks serious and they implement a non-trivial model with information retrieval (IR).

Our main contention is that the title and conclusions risk leaving the reader with a misleading impression. The abstract reads:

On average, the system nears the crowd aggregate of competitive forecasters, and in some settings surpasses it.

In the paper, they (correctly) state that a difference of .02 in Brier score is a large margin:

Only the GPT-4 and Claude-2 series beat the unskilled baseline by a large margin (> .02)

However, later on they summarize their main findings

As the main result, our averaged Brier score is .179, while the crowd achieves .149, resulting in a difference of .03.

So the main claim might as well read “There is still a large margin between human-level forecasting and forecasting with LLMs. These are the main results (note that accuracy, in contrast to the Brier score, is not a proper scoring rule):

Overall, differences are substantial. This result should not be very surprising since IR is genuinely hard and the example they show on page 25 just isn’t there yet: It just ends up finding links to Youtube and random users’ Tweets.

Reasoning and Tools for Human-Level Forecasting (Hsieh et al., 2024)

The standard for “human-level forecasting” in this paper is quite low. To create their dataset, the authors gathered questions from Manifold on April 15, 2024, and filtered for those resolving within two weeks. It’s likely that this yielded many low-volume markets, making the baseline rather weak. Also, there’s evidence to suggest that Manifold in general is not the strongest human forecasting baseline: In one investigation from 2023, Metaculus noticeably outperformed Manifold in a direct comparison on the same set of questions.

And there’s a further methodological issue. The authors compare Manifold predictions from April 15, 2024 to LLM predictions from an unspecified later date, when more information was available. They try to mitigate this using Google’s date range feature, but this feature is known to be unreliable.

Looking at a sample reasoning trace (page 7ff) also raises suspicions. It looks like their agent tries various approaches: Base rates, numerical simulations based on historical volatility, and judgemental adjustments. But both the base rate, as well as numerical simulations are completely hallucinated since their IR did not manage to find relevant data. (As pointed out above, good IR is a genuinely hard problem!)

It seems unlikely that a system relying on hallucinated base rates and numerical simulations goes all the way to outperforming (half-decent) human forecasters in any meaningful way.

LLMs Are Superhuman Forecasters (Phan et al., 2024)

Unlike (Halawi et al., 2024) and (Hsieh et al., 2024), they implicitly make the claim that no agent is needed for superhuman performance. Instead, two GPT-4o prompts with the most basic IR suffice.

There is a lot of pushback online, e.g. in the comment section of a related market (Will there be substantive issues with Safe AI’s claim to forecast better than the Metaculus crowd, found before 2025?) and on LessWrong. The main problems seem to be as follows:

Their results don’t seem to replicate on another set of questions (per Halawi). There is also some empirical evidence that the system doesn’t seem to give good forecasts.

There is also data contamination:

In addition, they only manage to beat the human crowd after applying some post-processing:

Maybe a fair criterion for judging “superhuman performance” could be “would you also beat the crowd if you applied the same post-processing to the human forecasts?”

Takeaways

  • Basic information retrieval is a hard problem. (See also our paper here.)

  • Advanced information retrieval, e.g. getting LLM-based systems to find high-quality relevant data without being thrown off by all the low-quality information is a hard problem.

  • Getting LLM-based systems to work out simple quantitative reasoning chains (e.g. base rates), instead of just hallucinating them, is genuinely hard.

All of the above appear to require significant engineering effort and extensive LLM scaffolding.

Simply throwing a ReAct agent (or another scaffolding method) at the problem and leaving the LLM to fend for itself is not enough with current LLMs.

Even a well-engineered effort, such as that from Halawi et al., produces chains of reasoning that often lag behind human forecasters, and fall far short of expert forecasting performance.

So how good are AI forecasters?

This remains to be seen. But taking it all together: from these papers, especially Halawi et al; FutureSearch’s preliminary (but not paper-quality rigorous) evals; the current Metaculus benchmarking tournament; and anecdotal evidence, we are fairly confident that

  • Today’s autonomous AI forecasting can be better than average, or even experienced, human forecasters,

  • But it’s very unlikely that any autonomous AI forecaster yet built is close to the accuracy of a top 2% Metaculus forecaster, or the crowd.

References

Halawi, D., Zhang, F., Yueh-Han, C., & Steinhardt, J. (2024, February 28). Approaching Human-Level Forecasting with Language Models. arXiv. https://​​arxiv.org/​​pdf/​​2402.18563

Hsieh, E., Fu, P., & Chen, J. (2024, August 21). Reasoning and Tools for Human-Level Forecasting. arXiv. https://​​www.arxiv.org/​​pdf/​​2408.12036

Phan, L., Khoja, A., Mazeika, M., & Hendrycks, D. (2024, September). LLMs Are Superhuman Forecasters. https://​​drive.google.com/​​file/​​d/​​1Tc_xY1NM-US4mZ4OpzxrpTudyo1W4KsE/​​view

Schoenegger, P., Park, P., Tuminauskaite, I., & Tetlock, P. (2024, July 22). Wisdom of the Silicon Crowd: LLM Ensemble Prediction Capabilities Rival Human Crowd Accuracy. arXiv. https://​​arxiv.org/​​pdf/​​2402.19379

Edited Sept 12, 2024 to remove a claim that Phan et al. compared their results to the average of five random forecasts rather than the Metaculus community prediction.

Edited Sept 16, 2024 to clarify that Schoenegger et al.’s aggregate forecast will usuallyhave no IR as it is the median over 12 models, 9 of which do not have access to the internet, instead of categorically ruling out IR.

  1. ^

    You could of course be even stricter than that, requiring forecasters to consistently beat any human or combination of humans. But that’s hard to measure so we think what we proposed is a reasonable definition. You could also include financial markets. But traders already use a lot of computers and people who can reliably beat the markets usually have better things to do than writing academic papers…