Fred Zhang comments on Approaching Human-Level Forecasting with Language Models

Fred Zhang 9 Mar 2024 2:33 UTC
1 point
0
Great questions, and thanks for the helpful comments!
underconfidence issues
We have not tried explicit extremizing. But in the study where we average our system’s prediction with community crowd, we find improved results better than both (under Brier scores). This essentially does the extremizing in those <10% cases.
However, humans on INFER and Metaculus are scored according to the integral of their Brier score over time, i.e. their score gets weighted by how long a given forecast is up
We were not aware of this! We always take unweighted average across the retrieval dates when evaluating our system. If we put more weights on the later retrieval dates, the gap between our system and human should be a bit smaller, for the reason you said.
- Relatedly, have you tried other retrieval schedules and if so did they affect the results
No, we have not tried. One alternative is to sample k random or uniformly spaced intervals within [open, resolve]. Unfortunately, this is not super kosher as this leaks the resolution date, which, as we argued in the paper, correlates with the label.
figure 4c
This is on validation set. Notice that the figure caption begins with “Figure 4: System strengths. Evaluating on the validation set, we note”
log score
We will update the paper soon to include the log score.
standard error in time series
See here for some alternatives in time series modeling.
I don’t know what is the perfect choice in judgemental forecasting, though. I am not sure if it has been studied at all (probably kind of an open question). Generally, the keyword here you can Google is “autocorrelation standard error” and “standard error in time series”.