Congrats on the excellent work! I’ve been following the LLM forecasting space for a while and your results are really pushing the frontier.
Some questions and comments:
AI underconfidence: The AI looks underconfident <10% and >90%. This is kind of apparent from the calibration curves in figure 3b (especially) and 3c (less so), though I’m not sure about this because the figures don’t have confidence intervals. However, table 3 (AI ensemble outperforms the crowd when the crowd is uncertain but the crowd outperforms the AI ensemble overall) and figure 4c (AI ensemble outperforms the crowd early on by a small margin but the crowd outperforms the AI ensemble near question close) seem to point in the same direction. My hypothesis is that this is explained (at least in part) by your use of trimmed mean to aggregate forecasts from individual models. Have you tried extremizing instead?
Performance over time: My understanding is that the AI’s Brier score is an unweighted average of the five forecasts corresponding to different retrieval times. However, humans on INFER and Metaculus are scored according to the integral of their Brier score over time, i.e. their score gets weighted by how long a given forecast is up. Given your retrieval schedule, wouldn’t your average put comparatively less weight on the AI’s final forecast? This may underestimate its performance (since the last forecast should be the best one).
Relatedly, have you tried other retrieval schedules and if so did they affect the results?
Also, if the AI’s Brier score is an unweighted average across retrieval times, then I’m confused about an apparent mismatch between table 3 and figure 4c. Table 3 says the AI’s average Brier score across all questions and retrieval times is .179, but in figure 4c the AI’s average Brier score across all questions is <.161 (roughly) at all retrieval times. So, if you average the datapoints in figure 4c you should get a number that’s <.161, not .179. Am I missing something?
Using log scores: This has already been addressed in other comments, but I’d be curious to see if humans still outperform AIs when using log scores.
Estimating standard errors: You note that your standard error estimates are likely to underestimate the true errors because your data is a time series and thus not iid. Do you think this matters in practice or is the underestimate likely to be small? Do you have any thoughts on how to estimate errors more accurately?
Great questions, and thanks for the helpful comments!
underconfidence issues
We have not tried explicit extremizing. But in the study where we average our system’s prediction with community crowd, we find improved results better than both (under Brier scores). This essentially does the extremizing in those <10% cases.
However, humans on INFER and Metaculus are scored according to the integral of their Brier score over time, i.e. their score gets weighted by how long a given forecast is up
We were not aware of this! We always take unweighted average across the retrieval dates when evaluating our system. If we put more weights on the later retrieval dates, the gap between our system and human should be a bit smaller, for the reason you said.
Relatedly, have you tried other retrieval schedules and if so did they affect the results
No, we have not tried. One alternative is to sample k random or uniformly spaced intervals within [open, resolve]. Unfortunately, this is not super kosher as this leaks the resolution date, which, as we argued in the paper, correlates with the label.
figure 4c
This is on validation set. Notice that the figure caption begins with “Figure 4: System strengths. Evaluating on the validation set, we note”
log score
We will update the paper soon to include the log score.
standard error in time series
See here for some alternatives in time series modeling.
I don’t know what is the perfect choice in judgemental forecasting, though. I am not sure if it has been studied at all (probably kind of an open question). Generally, the keyword here you can Google is “autocorrelation standard error” and “standard error in time series”.
Congrats on the excellent work! I’ve been following the LLM forecasting space for a while and your results are really pushing the frontier.
Some questions and comments:
AI underconfidence: The AI looks underconfident <10% and >90%. This is kind of apparent from the calibration curves in figure 3b (especially) and 3c (less so), though I’m not sure about this because the figures don’t have confidence intervals. However, table 3 (AI ensemble outperforms the crowd when the crowd is uncertain but the crowd outperforms the AI ensemble overall) and figure 4c (AI ensemble outperforms the crowd early on by a small margin but the crowd outperforms the AI ensemble near question close) seem to point in the same direction. My hypothesis is that this is explained (at least in part) by your use of trimmed mean to aggregate forecasts from individual models. Have you tried extremizing instead?
Performance over time: My understanding is that the AI’s Brier score is an unweighted average of the five forecasts corresponding to different retrieval times. However, humans on INFER and Metaculus are scored according to the integral of their Brier score over time, i.e. their score gets weighted by how long a given forecast is up. Given your retrieval schedule, wouldn’t your average put comparatively less weight on the AI’s final forecast? This may underestimate its performance (since the last forecast should be the best one).
Relatedly, have you tried other retrieval schedules and if so did they affect the results?
Also, if the AI’s Brier score is an unweighted average across retrieval times, then I’m confused about an apparent mismatch between table 3 and figure 4c. Table 3 says the AI’s average Brier score across all questions and retrieval times is .179, but in figure 4c the AI’s average Brier score across all questions is <.161 (roughly) at all retrieval times. So, if you average the datapoints in figure 4c you should get a number that’s <.161, not .179. Am I missing something?
Using log scores: This has already been addressed in other comments, but I’d be curious to see if humans still outperform AIs when using log scores.
Estimating standard errors: You note that your standard error estimates are likely to underestimate the true errors because your data is a time series and thus not iid. Do you think this matters in practice or is the underestimate likely to be small? Do you have any thoughts on how to estimate errors more accurately?
Great questions, and thanks for the helpful comments!
We have not tried explicit extremizing. But in the study where we average our system’s prediction with community crowd, we find improved results better than both (under Brier scores). This essentially does the extremizing in those <10% cases.
We were not aware of this! We always take unweighted average across the retrieval dates when evaluating our system. If we put more weights on the later retrieval dates, the gap between our system and human should be a bit smaller, for the reason you said.
No, we have not tried. One alternative is to sample k random or uniformly spaced intervals within [open, resolve]. Unfortunately, this is not super kosher as this leaks the resolution date, which, as we argued in the paper, correlates with the label.
This is on validation set. Notice that the figure caption begins with “Figure 4: System strengths. Evaluating on the validation set, we note”
We will update the paper soon to include the log score.
See here for some alternatives in time series modeling.
I don’t know what is the perfect choice in judgemental forecasting, though. I am not sure if it has been studied at all (probably kind of an open question). Generally, the keyword here you can Google is “autocorrelation standard error” and “standard error in time series”.