Fred Zhang comments on Approaching Human-Level Forecasting with Language Models

Fred Zhang 2 Mar 2024 7:12 UTC
4 points
0
We will be updating the paper with log scores.
I think human forecasters collaborating with their AI counterparts (in an assistance / debate setup) is a super interesting future direction. I imagine the strongest possible system we can build today will be of this sort. This related work explored this direction with some positive results.
is the better performance of the LM forecasts due to disagreeing about direction, or mostly due to marginally better calibration?
Definitely both. But more coming from the fact that the models don’t like to say extreme values (like, <5%), even when the evidence suggests so. This doesn’t necessarily hurt calibration, though, since calibration only cares about the error within each bin of the predicted probabilities.
This will result in a lot of very low probability forecasts, since it’s likely that only A or B occurs, especially closer to the resolution date.
Yes, so we didn’t do all of the multiple choice questions, only those that are already splitted into binary questions by the platforms. For example, if you query Metaculus API, some multiple choice questions are broken down into binary subquestions (each with their own community predictions etc). Our dataset is not dominated by such multiple-choice-turned-binary questions.
Does your system obey the Law of total probability?
No, and we didn’t try very hard to fully improve this. Similarly, if you ask the model the same binary question, but in the reverse way, the answers in general do not sum to 1. I think future systems should try to overcome this issue by enforcing the constraints in some way.
I’m left wondering what caused the human and LM forecasts to differ in accuracy.
By accuracy, we mean 0-1 error: so you round the probabilistic forecast to 0 or 1 whichever is the nearest, and measure the 0-1 loss. This means that as long as you are directionally correct, you will have good accuracy. (This is not a standard metric, but we choose to report it mostly to compare with prior works.) So the kind of hedging behavior doesn’t hurt accuracy, in general.
The McCarthy example [...] This is a failure in reasoning, not calibration, IMO.
This is a good point! We’ll add a bit more on how to interpret these qualitative examples. To be fair, these are hand-picked and I would caution against drawing strong conclusions from them.
In equation 1, is k 0-indexed or 1-indexed?
1-indexed.