Yes, the Google ‘search by date’ is basically completely busted and has been for a while (even though Google possesses the capability to date content accurately by using their Internet-wide snapshot archive going back to the 1990s, whose existence was recently publicly confirmed by the ad API leak). For kicks, try searching things like “Xi Jinping” with date-ranges like 2013… It is most regrettable, as it used to be a useful tool for me in digging up old stuff. There also seem to be issues in the other direction, where Google is ‘forgetting’ old articles which aren’t being indexed at all, apparently, in any publicly-searchable fashion (which might be contributing to the former, by a base rates screening-paradox effect—if all the real old articles have been forgotten by the index, then only erroneously timestamped hits will be available). I’m not aware of any search engine whose date cutoff is truly reliable. Even if they were, you would still have to manually check and clean to be sure that things like sidebars or recommendations were not causing data leakage.
I also agree that if this is really the only countermeasure to data leakage OP has taken, then the results seem dead on arrival. ML models infamously ruthlessly exploit far subtler forms of temporal data leakage than this...
It sounds like I’ll be waiting for some actually out-of-sample forecasting numbers before I believe any claims about beating Metaculus etc.
(This is a surprising mistake for a benchmarking expert to make. Even if you knew nothing about the specific problems with date-range search, it should be obvious how even with completely unedited, static snapshots from the past, that there would be leakage—like results will rank higher or lower based on future events. If Israel attacked Iran, obviously all articles before arguing that Israel will/should/could attack Iran are going to benefit from being ‘right’ and ranked higher than articles arguing the opposite, many of which will outright quietly disappear & cease to be mentioned, and a LLM conditioned on those rather than the lower-ranking ones will automatically & correctly ‘predict’ more accurately. And countless other leakages like that, which are not fixed as easily as “just download a snapshot from the IA”.)
Danny Halawi says there is lower performance on a different set of more heldout predictions, and the claims about GPT-4 knowledge cutoff are probably wrong:
The results in “LLMs Are Superhuman Forecasters” don’t hold when given another set of forecasting questions. I used their codebase (models, prompts, retrieval, etc.) to evaluate a new set of 324 questions—all opened after November 2023.
Findings:
Their Brier score: 0.195
Crowd Brier score: 0.141 [lower=better]
First issue:
The authors assumed that GPT-4o/GPT-4o-mini has a knowledge cut-off date of October 2023.
However, this is not correct.
For example, GPT-4o knows that Mike Johnson replaced Kevin McCarthy as speaker of the house.
This event happened at the end of October.
This also happens to be a question in the Metaculus dataset.
The results of the replication are so bad that I’d want to see somebody else review the methodology or try the same experiment or something before trusting that this is the “right” replication.
Comparing brier score between different question sets is not meaningful (intuitive example: Manifold hurts its Brier score with every daily coinflip market, and greatly improves its Brier score with every d20 die roll market, but both identically demonstrate zero predictive insight) [1]. You cannot call 0.195 good or bad or anything in between—Brier score is only useful when comparing on a shared question set.
The linked replication addresses this (same as the original paper)—the relevant comparison is the crowd Brier score of 0.141. For intuition, the gap between the crowd Metaculus Brier score of 0.141 & the AI’s 0.195 is roughly as large as the gap between 0.195 & 0.25 (the result if you guess 50% for all questions). So the claim of the replication is quite conclusive (the AI did far worse than the Metaculus crowd), the question is just whether that replication result is itself accurate.
[1]. Yes, Manifold reports this number on their website, and says it is “very good”—as a Manifold addict I would strongly encourage them to not do this. When I place bets on an event that already happened (which is super common), the Brier score contribution from that bet is near zero, i.e. impossibly good. And if I make a market that stays near 50% (also super common, e.g. if I want to maximize liquidity return), all the bets on that market push the site-wide Brier score towards the maximally non-predictive 0.25.
Slightly tangential, but do you know what the correct base rate of Manifold binary questions are? Like is it closer to 30% or closer to 50% for questions that resolve Yes?
Yes, the Google ‘search by date’ is basically completely busted and has been for a while (even though Google possesses the capability to date content accurately by using their Internet-wide snapshot archive going back to the 1990s, whose existence was recently publicly confirmed by the ad API leak). For kicks, try searching things like “Xi Jinping” with date-ranges like 2013… It is most regrettable, as it used to be a useful tool for me in digging up old stuff. There also seem to be issues in the other direction, where Google is ‘forgetting’ old articles which aren’t being indexed at all, apparently, in any publicly-searchable fashion (which might be contributing to the former, by a base rates screening-paradox effect—if all the real old articles have been forgotten by the index, then only erroneously timestamped hits will be available). I’m not aware of any search engine whose date cutoff is truly reliable. Even if they were, you would still have to manually check and clean to be sure that things like sidebars or recommendations were not causing data leakage.
I also agree that if this is really the only countermeasure to data leakage OP has taken, then the results seem dead on arrival. ML models infamously ruthlessly exploit far subtler forms of temporal data leakage than this...
It sounds like I’ll be waiting for some actually out-of-sample forecasting numbers before I believe any claims about beating Metaculus etc.
(This is a surprising mistake for a benchmarking expert to make. Even if you knew nothing about the specific problems with date-range search, it should be obvious how even with completely unedited, static snapshots from the past, that there would be leakage—like results will rank higher or lower based on future events. If Israel attacked Iran, obviously all articles before arguing that Israel will/should/could attack Iran are going to benefit from being ‘right’ and ranked higher than articles arguing the opposite, many of which will outright quietly disappear & cease to be mentioned, and a LLM conditioned on those rather than the lower-ranking ones will automatically & correctly ‘predict’ more accurately. And countless other leakages like that, which are not fixed as easily as “just download a snapshot from the IA”.)
EDIT: Metaculus discussion of date-range problems
Danny Halawi says there is lower performance on a different set of more heldout predictions, and the claims about GPT-4 knowledge cutoff are probably wrong:
The results of the replication are so bad that I’d want to see somebody else review the methodology or try the same experiment or something before trusting that this is the “right” replication.
Manifold claims a brier score of 0.17 and says it’s “very good” https://manifold.markets/calibration
Prediction markets in general don’t score much better https://calibration.city/accuracy . I wouldn’t say 0.195 is “so bad”
Comparing brier score between different question sets is not meaningful (intuitive example: Manifold hurts its Brier score with every daily coinflip market, and greatly improves its Brier score with every d20 die roll market, but both identically demonstrate zero predictive insight) [1]. You cannot call 0.195 good or bad or anything in between—Brier score is only useful when comparing on a shared question set.
The linked replication addresses this (same as the original paper)—the relevant comparison is the crowd Brier score of 0.141. For intuition, the gap between the crowd Metaculus Brier score of 0.141 & the AI’s 0.195 is roughly as large as the gap between 0.195 & 0.25 (the result if you guess 50% for all questions). So the claim of the replication is quite conclusive (the AI did far worse than the Metaculus crowd), the question is just whether that replication result is itself accurate.
[1]. Yes, Manifold reports this number on their website, and says it is “very good”—as a Manifold addict I would strongly encourage them to not do this. When I place bets on an event that already happened (which is super common), the Brier score contribution from that bet is near zero, i.e. impossibly good. And if I make a market that stays near 50% (also super common, e.g. if I want to maximize liquidity return), all the bets on that market push the site-wide Brier score towards the maximally non-predictive 0.25.
Slightly tangential, but do you know what the correct base rate of Manifold binary questions are? Like is it closer to 30% or closer to 50% for questions that resolve Yes?