To empirically test this ability, we enrolled OpenAI’s state-of-the-art large language model, GPT-4, in a three-month forecasting tournament hosted on the Metaculus platform. The tournament, running from July to October 2023, attracted 843 participants and covered diverse topics including Big Tech, U.S. politics, viral outbreaks, and the Ukraine conflict. Focusing on binary forecasts, we show that GPT-4′s probabilistic forecasts are significantly less accurate than the median human-crowd forecasts. We find that GPT-4′s forecasts did not significantly differ from the no-information forecasting strategy of assigning a 50% probability to every question.
From the paper:
These data indicate that in 18 out of 23 questions, the median human-crowd forecasts were directionally closer to the truth than GPT-4’s predictions,
[...]
We observe an average Brier score for GPT-4’s predictions of B = .20 (SD = .18), while the human forecaster average Brier score was B = .07 (SD = .08).
They don’t cite the de-calibration result from the GPT-4 paper, but the distribution of GPT-4′s ratings here looks like it’s been tuned to be mealy-mouthed: humped at 60%, so it agrees with whatever you say but then can’t even do so enthusiastically https://arxiv.org/pdf/2310.13014.pdf#page=6 .
Here is a related paper on “how good are language models at predictions”, also testing the abilities of GPT-4: Large Language Model Prediction Capabilities: Evidence from a Real-World Forecasting Tournament.
Portion of the abstract:
From the paper:
[...]
They don’t cite the de-calibration result from the GPT-4 paper, but the distribution of GPT-4′s ratings here looks like it’s been tuned to be mealy-mouthed: humped at 60%, so it agrees with whatever you say but then can’t even do so enthusiastically https://arxiv.org/pdf/2310.13014.pdf#page=6 .