Interesting work, congrats on achieving human-ish performance!
I expect your model would look relatively better under other proper scoring rules. For example, logarithmic scoring would punish the human crowd for giving >1% probabilities to events that even sometimes happen. Under the Brier score, the worst possible score is either a 1 or a 2 depending on how it’s formulated (from skimming your paper, it looks like 1 to me). Under a logarithmic score, such forecasts would be severely punished. I don’t think this is something you should lead with, since Brier scores are the more common scoring rule in the literature, but it seems like an easy win and would highlight the possible benefits of the model’s relatively conservative forecasting.
I’m curious how a more sophisticated human-machine hybrid would perform with these much stronger machine models, I expect quite well. I did some research with human-machine hybrids before and found modest improvements from incorporating machine forecasts (e.g. chapter 5, section 5.2.4 of my dissertation Metacognitively Wise Crowds & the sections “Using machine models for scalable forecasting” and “Aggregate performance” in Hybrid forecasting of
geopolitical events.), but the machine models we were using were very weak on their own (depending on how I analyzed things, they were outperformed by guessing). In “System Complements the Crowd”, you aggregate a linear average of the full aggregate of the crowd and the machine model, but we found that treating the machine as an exceptionally skilled forecaster resulted in the best performance of the overall system. As a result of this method, the machine forecast would be down-weighted in the aggregate as more humans forecasted on the question, which we found helped performance. You would need access to the individuated data of the forecasting platform to do this, however.
If you’re looking for additional useful plots, you could look at Human Forecast (probability) vs AI Forecast (probability) on a question-by-question basis and get a sense of how the humans and AI agree and disagree. For example, is the better performance of the LM forecasts due to disagreeing about direction, or mostly due to marginally better calibration? This would be harder to plot for multinomial questions, although there you could plot the probability assigned to the correct response option as long as the question isn’t ordinal.
I see that you only answered Binary questions and that you split multinomial questions. How did you do this? I suspect you did this by rephrasing questions of the form “What will $person do on $date, A, B, C, D, E, or F?” into “Will $person do A on $date?”, “Will $person do B on $date?”, and so on. This will result in a lot of very low probability forecasts, since it’s likely that only A or B occurs, especially closer to the resolution date. Also, does your system obey the Law of total probability (i.e. does it assign exactly 100% probability to the union of A, B, C, D, E, and F)? This might be a way to improve performance of the system and coax your model into giving extreme forecasts that are grounded in reality (simply normalizing across the different splits of the multinomial question here would probably work pretty well).
Why do human and LM forecasts differ? You plot calibration, and the human and LM forecasts are both well calibrated for the most part, but with your focus on system performance I’m left wondering what caused the human and LM forecasts to differ in accuracy. You claim that it’s because of a lack of extremization on the part of the LM forecast (i.e. that it gives too many 30-70% forecasts, while humans give more extreme forecasts), but is that an issue of calibration? You seemed to say that it isn’t, but then the problem isn’t that the model is outputting the wrong forecast given what it knows (i.e. that it “hedge[s] predictions due to its safety training”), but rather that it is giving its best account of the probability given what it knows. The problem with e.g. the McCarthy question (example output #1) seems to me that the system does not understand the passage of time, and so it has no sense that because it has information from November 30th and it’s being asked a question about what happens on November 30th, it can answer with confidence. This is a failure in reasoning, not calibration, IMO. It’s possible I’m misunderstanding what cutoff is being used for example output #1.
Miscellaneous question: In equation 1, is k 0-indexed or 1-indexed?
I think human forecasters collaborating with their AI counterparts (in an assistance / debate setup) is a super interesting future direction. I imagine the strongest possible system we can build today will be of this sort. This related work explored this direction with some positive results.
is the better performance of the LM forecasts due to disagreeing about direction, or mostly due to marginally better calibration?
Definitely both. But more coming from the fact that the models don’t like to say extreme values (like, <5%), even when the evidence suggests so. This doesn’t necessarily hurt calibration, though, since calibration only cares about the error within each bin of the predicted probabilities.
This will result in a lot of very low probability forecasts, since it’s likely that only A or B occurs, especially closer to the resolution date.
Yes, so we didn’t do all of the multiple choice questions, only those that are already splitted into binary questions by the platforms. For example, if you query Metaculus API, some multiple choice questions are broken down into binary subquestions (each with their own community predictions etc). Our dataset is not dominated by such multiple-choice-turned-binary questions.
Does your system obey the Law of total probability?
No, and we didn’t try very hard to fully improve this. Similarly, if you ask the model the same binary question, but in the reverse way, the answers in general do not sum to 1. I think future systems should try to overcome this issue by enforcing the constraints in some way.
I’m left wondering what caused the human and LM forecasts to differ in accuracy.
By accuracy, we mean 0-1 error: so you round the probabilistic forecast to 0 or 1 whichever is the nearest, and measure the 0-1 loss. This means that as long as you are directionally correct, you will have good accuracy. (This is not a standard metric, but we choose to report it mostly to compare with prior works.) So the kind of hedging behavior doesn’t hurt accuracy, in general.
The McCarthy example [...] This is a failure in reasoning, not calibration, IMO.
This is a good point! We’ll add a bit more on how to interpret these qualitative examples. To be fair, these are hand-picked and I would caution against drawing strong conclusions from them.
Interesting work, congrats on achieving human-ish performance!
I expect your model would look relatively better under other proper scoring rules. For example, logarithmic scoring would punish the human crowd for giving >1% probabilities to events that even sometimes happen. Under the Brier score, the worst possible score is either a 1 or a 2 depending on how it’s formulated (from skimming your paper, it looks like 1 to me). Under a logarithmic score, such forecasts would be severely punished. I don’t think this is something you should lead with, since Brier scores are the more common scoring rule in the literature, but it seems like an easy win and would highlight the possible benefits of the model’s relatively conservative forecasting.
I’m curious how a more sophisticated human-machine hybrid would perform with these much stronger machine models, I expect quite well. I did some research with human-machine hybrids before and found modest improvements from incorporating machine forecasts (e.g. chapter 5, section 5.2.4 of my dissertation Metacognitively Wise Crowds & the sections “Using machine models for scalable forecasting” and “Aggregate performance” in Hybrid forecasting of geopolitical events.), but the machine models we were using were very weak on their own (depending on how I analyzed things, they were outperformed by guessing). In “System Complements the Crowd”, you aggregate a linear average of the full aggregate of the crowd and the machine model, but we found that treating the machine as an exceptionally skilled forecaster resulted in the best performance of the overall system. As a result of this method, the machine forecast would be down-weighted in the aggregate as more humans forecasted on the question, which we found helped performance. You would need access to the individuated data of the forecasting platform to do this, however.
If you’re looking for additional useful plots, you could look at Human Forecast (probability) vs AI Forecast (probability) on a question-by-question basis and get a sense of how the humans and AI agree and disagree. For example, is the better performance of the LM forecasts due to disagreeing about direction, or mostly due to marginally better calibration? This would be harder to plot for multinomial questions, although there you could plot the probability assigned to the correct response option as long as the question isn’t ordinal.
I see that you only answered Binary questions and that you split multinomial questions. How did you do this? I suspect you did this by rephrasing questions of the form “What will $person do on $date, A, B, C, D, E, or F?” into “Will $person do A on $date?”, “Will $person do B on $date?”, and so on. This will result in a lot of very low probability forecasts, since it’s likely that only A or B occurs, especially closer to the resolution date. Also, does your system obey the Law of total probability (i.e. does it assign exactly 100% probability to the union of A, B, C, D, E, and F)? This might be a way to improve performance of the system and coax your model into giving extreme forecasts that are grounded in reality (simply normalizing across the different splits of the multinomial question here would probably work pretty well).
Why do human and LM forecasts differ? You plot calibration, and the human and LM forecasts are both well calibrated for the most part, but with your focus on system performance I’m left wondering what caused the human and LM forecasts to differ in accuracy. You claim that it’s because of a lack of extremization on the part of the LM forecast (i.e. that it gives too many 30-70% forecasts, while humans give more extreme forecasts), but is that an issue of calibration? You seemed to say that it isn’t, but then the problem isn’t that the model is outputting the wrong forecast given what it knows (i.e. that it “hedge[s] predictions due to its safety training”), but rather that it is giving its best account of the probability given what it knows. The problem with e.g. the McCarthy question (example output #1) seems to me that the system does not understand the passage of time, and so it has no sense that because it has information from November 30th and it’s being asked a question about what happens on November 30th, it can answer with confidence. This is a failure in reasoning, not calibration, IMO. It’s possible I’m misunderstanding what cutoff is being used for example output #1.
Miscellaneous question: In equation 1, is k 0-indexed or 1-indexed?
We will be updating the paper with log scores.
I think human forecasters collaborating with their AI counterparts (in an assistance / debate setup) is a super interesting future direction. I imagine the strongest possible system we can build today will be of this sort. This related work explored this direction with some positive results.
Definitely both. But more coming from the fact that the models don’t like to say extreme values (like, <5%), even when the evidence suggests so. This doesn’t necessarily hurt calibration, though, since calibration only cares about the error within each bin of the predicted probabilities.
Yes, so we didn’t do all of the multiple choice questions, only those that are already splitted into binary questions by the platforms. For example, if you query Metaculus API, some multiple choice questions are broken down into binary subquestions (each with their own community predictions etc). Our dataset is not dominated by such multiple-choice-turned-binary questions.
No, and we didn’t try very hard to fully improve this. Similarly, if you ask the model the same binary question, but in the reverse way, the answers in general do not sum to 1. I think future systems should try to overcome this issue by enforcing the constraints in some way.
By accuracy, we mean 0-1 error: so you round the probabilistic forecast to 0 or 1 whichever is the nearest, and measure the 0-1 loss. This means that as long as you are directionally correct, you will have good accuracy. (This is not a standard metric, but we choose to report it mostly to compare with prior works.) So the kind of hedging behavior doesn’t hurt accuracy, in general.
This is a good point! We’ll add a bit more on how to interpret these qualitative examples. To be fair, these are hand-picked and I would caution against drawing strong conclusions from them.
1-indexed.