Are “superforecasters” a real phenomenon?
In https://slatestarcodex.com/2016/02/04/book-review-superforecasting/, Scott writes:
…okay, now we’re getting to a part I don’t understand. When I read Tetlock’s paper, all he says is that he took the top sixty forecasters, declared them superforecasters, and then studied them intensively. That’s fine; I’d love to know what puts someone in the top 2% of forecasters. But it’s important not to phrase this as “Philip Tetlock discovered that 2% of people are superforecasters”. This suggests a discontinuity, a natural division into two groups. But unless I’m missing something, there’s no evidence for this. Two percent of forecasters were in the top two percent. Then Tetlock named them “superforecasters”. We can discuss what skills help people make it this high, but we probably shouldn’t think of it as a specific phenomenon.
But in this article https://www.vox.com/future-perfect/2020/1/7/21051910/predictions-trump-brexit-recession-2019-2020, Kelsey Piper and Dylan Matthews write:
Tetlock and his collaborators have run studies involving tens of thousands of participants and have discovered that prediction follows a power law distribution. That is, most people are pretty bad at it, but a few (Tetlock, in a Gladwellian twist, calls them “superforecasters”) appear to be systematically better than most at predicting world events.
seeming to disagree. I’m curious who’s right.
So there’s the question of “is superforecaster a natural category” and I’m operationalizing that into “do the performances of GJP participants follow a power-law distribution, such that the best 2% are significantly better than the rest”?
Does anyone know the answer to that question? (And/or does anyone want to argue with that operationalization?)
The clear answer to the question posed, “do the performances of GJP participants follow a power-law distribution, such that the best 2% are significantly better than the rest” is yes—with a minor quibble, and a huge caveat. (Epistemic status: I’m very familiar with the literature, have personal experience as a superforecaster since the beginning, had discussions with Dan Gardner and the people running, have had conversations with the heads of Good Judgement Inc, etc.)
The minor quibble is identified in other comments, that it is unlikely that there is a sharp cutoff at 2%, there isn’t a discontinuity, and power law is probably the wrong term. Aside from those “minor” issues, yes, there is a clear group of people who outperformed multiple years in a row, and this groups was fairly consistent from year to year. Not only that, but the order withing that group is far more stable than chance. That clearly validates the claim that “superforcasters are a real thing.”
But the data that those people are better is based on a number of things, many of which aren’t what you would think. First, the biggest difference between top forecasters and the rest is frequency of updates and a corresponding willingness to change their minds as evidence comes in. People who invest time in trying to forecast well do better than those who don’t—to that extent, it’s a skill like most others. Second, success at forecasting is predicted by most of the things that predict success at almost everything else—intelligence, time spent, and looking for ways to improve. Some of the techniques that Good Judgement advocates for superforecasters are from people who read Kahneman and Twersky, Tetlock, and related research, and tried to apply the ideas. The things that worked were adopted—but not everything helped. Other techniques were original to the participants—for instance, explicitly comparing your estimate for a question based on different timeframes, to ensure it is a coherent and reasonable probability. (Will X happen in the next 4 months? If we changed that to one month, would be estimate be about a quarter as high? What about if it were a year? If my intuition for the answer is about the same, I need to fix that.) Ideas like this are not natural ability, they are just applying intelligence to a problem they care about.
Also, many of the poorer performers were people who didn’t continue forecasting, and their initial numbers got stale—they presumably would have updated. The best performers, on the other hand, checked the news frequently, and updated. At times, we would change a forecast once the event had / had not happened, a couple days before the question was closed, yielding a reasonably large “improvement” in our time-weighted score. This isn’t a function of being naturally better—it’s just the investment of time that helps. (This also explains a decent part of why weighting recency in aggregate scores is helpful—it removes stale forecasts.)
So in short, I’m unconvinced that superforecasters are a “real” thing, except in the sense that most people don’t try, and people who do will do better, and improve over time. Given that, however, we absolutely should rely on superforecasters to make better predictions that the rest of people—as long as they continue doing the things that make them good forecasters.
I think one really important decision-relevant question is:
My impression is that the Good Judgement Project used several tests to attempt to identify forecasters, but the tests didn’t predict the superforecasters as well as what some may have desired.
Do you think that almost all of this can be explained either by:
Diligence to the questions, similar to your example of the MMORPG?
Other simple things that we may be able to figure out in the next few years?
If so, I imagine the value of being a “superforecaster” would go down a bit, but the value of being “a superforecaster in expectation” would go up.
Yes—I suspect a large amount of the variance is explained by features we can measure, and the residual may be currently unexplained, but filtering on the features you can measure probably gets most of what is needed.
However, I don’t think the conclusion necessarily follows.
The problem is a causal reasoning / incentive issue (because of reasons) - just because people who update frequently do well doesn’t mean that telling people you’ll pay those who update frequently will cause them to do better now that they update more often. For instance, if you took MMORPG players and gave them money on condition that they spend money on the game, you’ll screw up the relationship between spending and success.
Fair point. I’m sure you expect some correlation between the use of reasonable incentive structures and useful updating though. It may not be perfect, but I’d be surprised if it were 0.
Agreed.
Thanks for your reply!
It looks to me like we might be thinking about different questions. Basically I’m just concerned about the sentence “Philip Tetlock discovered that 2% of people are superforecasters.” When I read this sentence, it reads to me like “2% of people are superheroes” — they have performance that is way better than the rest of the population on these tasks. If you graphed “jump height” of the population and 2% of the population is Superman, there would be a clear discontinuity at the higher end. That’s what I imagine when I read the sentence, and that’s what I’m trying to get at above.
It looks like you’re saying that this isn’t true?
(It looks to me like you’re discussing the question of how innate “superforecasting” is. To continue the analogy, whether superforecasters have innate powers like Superman or are just normal humans who train hard like Batman. But I think this is orthogonal to what I’m talking about. I know the sentence “are superforecasters a ‘real’ phenomenon” has multiple operationalizations, which is why I specified one as what I was talking about.)
But note that the section you quote from Vox doesn’t say that there’s any discontinuity:
A power law distribution is not a discontinuity! Some people are way way better than others. Other people are merely way better than others. And still others are only better than others.
I think the sentence is misleading (as per Scott Alexander). A better sentence should give the impression that, by way of analogy, some basketball players are NBA players. They may seem superhuman in their basketball ability compared to the Average Joe. And there are a combination of innate traits as well as honed skills that got them there. These would be interesting to study if you wanted to know how to play basketball well. Or if you were putting together a team to play against the Monstars.
But there’s no discontinuity. Going down the curve from NBA players, you get to professional players in other leagues, and then to division 1 college players, and then division 2, etc. Somewhere after bench warmer on their high school basketball team, you get to Average Joe.
So SSC and Vox are both right. Some people are way way better than others (with a power law-like distribution), but there’s no discontinuity.
This analogy seems like a good way of explaining it. Saying (about forecasting ability) that some people are superforecasters is similar to saying (about basketball ability) that some people are NBA players or saying (about chess ability) that some people are Grandmasters. If you understand in detail the meaning of any one of these claims (or a similar claim about another domain besides forecasting/basketball/chess), then most of what you could say about that claim would port over pretty straightforwardly to the other claims.
(I’ll back off the Superman analogy; I think it’s disanalogous b/c of the discontinuity thing you point out.)
Yeah I like the analogue “some basketball players are NBA players.” It makes it sound totally unsurprising, which it is.
I don’t agree that Vox is right, because:
- I can’t find any evidence for the claim that forecasting ability is power-law distributed, and it’s not clear what that would mean with Brier scores (as Unnamed points out).
- Their use of the term “discovered.”
I don’t think I’m just quibbling over semantics; I definitely had the wrong idea about superforecasters prior to thinking it through, it seems like Vox might have it too, and I’m concerned others who read the article will get the wrong idea as well.
From participating on Metaculus I certainly don’t get the sense that there are people who make uncannily good predictions. If you compare the community prediction to the Metaculus prediction, it looks like there’s a 0.14 difference in average log score, which I guess means a combination of the best predictors tends to put e^(0.14) or 1.15 times as much probability on the correct answer as the time-weighted community median. (The postdiction is better, but I guess subject to overfitting?) That’s substantial, but presumably the combination of the best predictors is better than every individual predictor. The Metaculus prediction also seems to be doing a lot worse than the community prediction on recent questions, so I don’t know what to make of that. I suspect that, while some people are obviously better at forecasting than others, the word “superforecasters” has no content outside of “the best forecasters” and is just there to make the field of research sound more exciting.
Agreed. As I said, “it is unlikely that there is a sharp cutoff at 2%, there isn’t a discontinuity, and power law is probably the wrong term.”
As you concluded in other comments, this is wrong. But there doesn’t need to be a sharp cutoff for there to be “way better” performance. If the top 1% consistently have brier scores on a class of questions of 0.01, the next 1% have brier scores of 0.02, and so on, you’d see “way better performance” without a sharp cutoff—and we’d see that the median brier score of 0.5, exactly as good as flipping a coin, is WAY worse than the people at the top. (Let’s assume everyone else is at least as good as flipping a coin, so the bottom half are all equally useless.)
If there isn’t a discontinuity, then how is there a clear group that outperformed?
See: https://www.lesswrong.com/posts/uoyn67q3HtB2ns2Yg/are-superforecasters-a-real-phenomenon#e9uGgK7PinFgK2o2z
I’d consider this something like Superforecasting as a continuum rather than a category in that case, and 2% seems quite arbitrary as does calling them superforecasters.
That makes sense as an approach—but as mentioned initially, I think the issue with calling people superforecasters is deeper, since it’s unclear how much of the performance is even about their skill, rather than other factors.
Instead of basketball and the NBA, I’d compare superforecasting to performance at a modern (i.e. pay-to-win) mobile MMORPG: you need to be good to perform near the top, but the other factor that separates winners and losers is being willing to invest much more than others in loot boxes and items (i.e. time spent forecasting) because you really want to win.
In this analysis is there any assumption about information states? Is the idea that the forecasts are all based on public information everyone has available to them? Or can that explain part of the different performance and then we need to look at a subset with perhaps better access to the information and see how they perform against one another—or various types of informational asymmetries or institutional factors related to the information.
Superforecasters used only public information, or information they happened to have access to—but the original project was run in parallel with a (then secret) prediction platform for inside the intelligence community. It turned out that the intelligence people were significantly outperformed by superforecasters, despite having access to classified information and commercial information sources, so it seems clear that the information access wasn’t particularly critical for the specific class of geopolitical predictions they looked at. This is probably very domain dependent, however.
Thanks. Interesting, though not too surprising in some ways.
I was going off absence of evidence (the paper didn’t say anything other than taking the top 2%), so if anyone else has positive evidence that outweighs what I’m saying.
See my response below—and the dataset of forecasts is now public if you wanted to check the numbers.
I don’t see much disagreement between the two sources. The Vox article doesn’t claim that there is much reason for selecting the top 2% rather than the top 1% or the top 4% or whatever. And the SSC article doesn’t deny that the people who scored in the top 2% (and are thereby labeled “Superforecasters”) systematically do better than most at forecasting.
I’m puzzled by the use of the term “power law distribution”. I think that the GJP measured forecasting performance using Brier scores, and Brier scores are always between 0 and 1, which is the wrong shape for a fat-tailed distribution. And the next sentence (which begins “that is”) isn’t describing anything specific to power law distributions. So probably the Vox article is just misusing the term.
Hmm, thanks for pointing that out about Brier scores. The Vox article cites https://www.vox.com/2015/8/20/9179657/tetlock-forecasting for its “power law” claim, but that piece says nothing about power laws. It does have a graph which depicts a wide gap between “superforecasters” and “top-team individuals” in years 2 and 3 of the project, and not in year 1. But my understanding is that this is because the superforecasters were put together on elite teams after the first year, so I think the graph is a bit misleading.
(Citation: the paper https://stanford.edu/~knutson/nfc/mellers15.pdf)
I do think there’s disagreement between the sources — when I read sentences like this from the Vox article
I definitely imagine looking at a graph of everyone’s performance on the predictions and noticing a cluster who are discontinuously much better than everyone else. I would be surprised if the authors of the piece didn’t imagine this as well. The article they link to does exactly what Scott warns against, saying “Tetlock’s team found out that some people were ‘superforecasters’.”
Some evidence against this is that they described it as being a “power law” distribution, which is continuous and doesn’t have these kinds of clusters. (It just goes way way up as you move to the right.)
If you had a power law distribution, it would still be accurate to say that “a few are better than most”, even though there isn’t a discontinuous break anywhere.
EDIT: It seems to me that most things like this follow approximately continuous distributions. And so whenever you hear someone talking about something like this you should assume it’s continuous unless it’s super clear that it’s not (and that should be a surprising fact in need of explanation!). But note that people will often talk about it in misleading ways, because for the sake of discussion it’s often simpler to talk about it as if there are these discrete groups. So just because people are talking about it as if there are discrete groups does not mean they actually think there are discrete groups. I think that’s what happened here.
Just observing that the answer to this question should be more or less obvious from a histogram (assuming large enough N and a sufficient number of buckets), “Is there a substantial discontinuity at the 2% quantile?”
Power law behaviour is not necessary and arguably not sufficient for “superforecasters are a natural category” to win (e.g. it should win in a population in which 2% have a brier score of zero and the rest 1, which is not a power law).
Agree re: power law.
The data is here https://dataverse.harvard.edu/dataverse/gjp?q=&types=files&sort=dateSort&order=desc&page=1 , so I could just find out. I posted here trying to save time, hoping someone else would already have done the analysis.
By definition, the top 2% are always better than the other 98%.
Pre-selected so-called superforecasters were one of the groups asked to anticipate the outcome of UK’s referendum on membership of the European Union and they gave a 23% chance of a vote against membership succeeding—the lowest % and worst performance of all groups asked to predict the result. For a forecasting system to be reliable, it has to be assessed as to its skill over a period of continuous forecasting that will tell you what % of forecasts are correct. To.be useful, this figure should be about 70% or more. The only forecasting system that exceeds this level of skill is the weather forecasting system. Superforecasting does not exist.
Huh, do you have a link for this claim? In particular the sentence “the lowest % and worst performance of all groups asked to predict the result”?