The clear answer to the question posed, “do the performances of GJP participants follow a power-law distribution, such that the best 2% are significantly better than the rest” is yes—with a minor quibble, and a huge caveat. (Epistemic status: I’m very familiar with the literature, have personal experience as a superforecaster since the beginning, had discussions with Dan Gardner and the people running, have had conversations with the heads of Good Judgement Inc, etc.)
The minor quibble is identified in other comments, that it is unlikely that there is a sharp cutoff at 2%, there isn’t a discontinuity, and power law is probably the wrong term. Aside from those “minor” issues, yes, there is a clear group of people who outperformed multiple years in a row, and this groups was fairly consistent from year to year. Not only that, but the order withing that group is far more stable than chance. That clearly validates the claim that “superforcasters are a real thing.”
But the data that those people are better is based on a number of things, many of which aren’t what you would think. First, the biggest difference between top forecasters and the rest is frequency of updates and a corresponding willingness to change their minds as evidence comes in. People who invest time in trying to forecast well do better than those who don’t—to that extent, it’s a skill like most others. Second, success at forecasting is predicted by most of the things that predict success at almost everything else—intelligence, time spent, and looking for ways to improve. Some of the techniques that Good Judgement advocates for superforecasters are from people who read Kahneman and Twersky, Tetlock, and related research, and tried to apply the ideas. The things that worked were adopted—but not everything helped. Other techniques were original to the participants—for instance, explicitly comparing your estimate for a question based on different timeframes, to ensure it is a coherent and reasonable probability. (Will X happen in the next 4 months? If we changed that to one month, would be estimate be about a quarter as high? What about if it were a year? If my intuition for the answer is about the same, I need to fix that.) Ideas like this are not natural ability, they are just applying intelligence to a problem they care about.
Also, many of the poorer performers were people who didn’t continue forecasting, and their initial numbers got stale—they presumably would have updated. The best performers, on the other hand, checked the news frequently, and updated. At times, we would change a forecast once the event had / had not happened, a couple days before the question was closed, yielding a reasonably large “improvement” in our time-weighted score. This isn’t a function of being naturally better—it’s just the investment of time that helps. (This also explains a decent part of why weighting recency in aggregate scores is helpful—it removes stale forecasts.)
So in short, I’m unconvinced that superforecasters are a “real” thing, except in the sense that most people don’t try, and people who do will do better, and improve over time. Given that, however, we absolutely should rely on superforecasters to make better predictions that the rest of people—as long as they continue doing the things that make them good forecasters.
I think one really important decision-relevant question is:
“Do we need to have forecasters spend years forecasting questions before we can get a good sense of how good they are, or can we get most of that information with a quick (<1 week) test?”
My impression is that the Good Judgement Project used several tests to attempt to identify forecasters, but the tests didn’t predict the superforecasters as well as what some may have desired.
Do you think that almost all of this can be explained either by:
Diligence to the questions, similar to your example of the MMORPG?
Other simple things that we may be able to figure out in the next few years?
If so, I imagine the value of being a “superforecaster” would go down a bit, but the value of being “a superforecaster in expectation” would go up.
Yes—I suspect a large amount of the variance is explained by features we can measure, and the residual may be currently unexplained, but filtering on the features you can measure probably gets most of what is needed.
However, I don’t think the conclusion necessarily follows.
The problem is a causal reasoning / incentive issue (because of reasons) - just because people who update frequently do well doesn’t mean that telling people you’ll pay those who update frequently will cause them to do better now that they update more often. For instance, if you took MMORPG players and gave them money on condition that they spend money on the game, you’ll screw up the relationship between spending and success.
Fair point. I’m sure you expect some correlation between the use of reasonable incentive structures and useful updating though. It may not be perfect, but I’d be surprised if it were 0.
It looks to me like we might be thinking about different questions. Basically I’m just concerned about the sentence “Philip Tetlock discovered that 2% of people are superforecasters.” When I read this sentence, it reads to me like “2% of people are superheroes” — they have performance that is way better than the rest of the population on these tasks. If you graphed “jump height” of the population and 2% of the population is Superman, there would be a clear discontinuity at the higher end. That’s what I imagine when I read the sentence, and that’s what I’m trying to get at above.
It looks like you’re saying that this isn’t true?
(It looks to me like you’re discussing the question of how innate “superforecasting” is. To continue the analogy, whether superforecasters have innate powers like Superman or are just normal humans who train hard like Batman. But I think this is orthogonal to what I’m talking about. I know the sentence “are superforecasters a ‘real’ phenomenon” has multiple operationalizations, which is why I specified one as what I was talking about.)
If you graphed “jump height” of the population and 2% of the population is Superman, there would be a clear discontinuity at the higher end.
But note that the section you quote from Vox doesn’t say that there’s any discontinuity:
Tetlock and his collaborators have run studies involving tens of thousands of participants and have discovered that prediction follows a power law distribution.
A power law distribution is not a discontinuity! Some people are way way better than others. Other people are merely way better than others. And still others are only better than others.
“Philip Tetlock discovered that 2% of people are superforecasters.” When I read this sentence, it reads to me like “2% of people are superheroes”
I think the sentence is misleading (as per Scott Alexander). A better sentence should give the impression that, by way of analogy, some basketball players are NBA players. They may seem superhuman in their basketball ability compared to the Average Joe. And there are a combination of innate traits as well as honed skills that got them there. These would be interesting to study if you wanted to know how to play basketball well. Or if you were putting together a team to play against the Monstars.
But there’s no discontinuity. Going down the curve from NBA players, you get to professional players in other leagues, and then to division 1 college players, and then division 2, etc. Somewhere after bench warmer on their high school basketball team, you get to Average Joe.
So SSC and Vox are both right. Some people are way way better than others (with a power law-like distribution), but there’s no discontinuity.
A better sentence should give the impression that, by way of analogy, some basketball players are NBA players.
This analogy seems like a good way of explaining it. Saying (about forecasting ability) that some people are superforecasters is similar to saying (about basketball ability) that some people are NBA players or saying (about chess ability) that some people are Grandmasters. If you understand in detail the meaning of any one of these claims (or a similar claim about another domain besides forecasting/basketball/chess), then most of what you could say about that claim would port over pretty straightforwardly to the other claims.
(I’ll back off the Superman analogy; I think it’s disanalogous b/c of the discontinuity thing you point out.)
Yeah I like the analogue “some basketball players are NBA players.” It makes it sound totally unsurprising, which it is.
I don’t agree that Vox is right, because:
- I can’t find any evidence for the claim that forecasting ability is power-law distributed, and it’s not clear what that would mean with Brier scores (as Unnamed points out).
- Their use of the term “discovered.”
I don’t think I’m just quibbling over semantics; I definitely had the wrong idea about superforecasters prior to thinking it through, it seems like Vox might have it too, and I’m concerned others who read the article will get the wrong idea as well.
From participating on Metaculus I certainly don’t get the sense that there are people who make uncannily good predictions. If you compare the community prediction to the Metaculus prediction, it looks like there’s a 0.14 difference in average log score, which I guess means a combination of the best predictors tends to put e^(0.14) or 1.15 times as much probability on the correct answer as the time-weighted community median. (The postdiction is better, but I guess subject to overfitting?) That’s substantial, but presumably the combination of the best predictors is better than every individual predictor. The Metaculus prediction also seems to be doing a lot worse than the community prediction on recent questions, so I don’t know what to make of that. I suspect that, while some people are obviously better at forecasting than others, the word “superforecasters” has no content outside of “the best forecasters” and is just there to make the field of research sound more exciting.
it reads to me like “2% of people are superheroes” — they have performance that is way better than the rest of the population on these tasks.
As you concluded in other comments, this is wrong. But there doesn’t need to be a sharp cutoff for there to be “way better” performance. If the top 1% consistently have brier scores on a class of questions of 0.01, the next 1% have brier scores of 0.02, and so on, you’d see “way better performance” without a sharp cutoff—and we’d see that the median brier score of 0.5, exactly as good as flipping a coin, is WAY worse than the people at the top. (Let’s assume everyone else is at least as good as flipping a coin, so the bottom half are all equally useless.)
I’d consider this something like Superforecasting as a continuum rather than a category in that case, and 2% seems quite arbitrary as does calling them superforecasters.
That makes sense as an approach—but as mentioned initially, I think the issue with calling people superforecasters is deeper, since it’s unclear how much of the performance is even about their skill, rather than other factors.
Instead of basketball and the NBA, I’d compare superforecasting to performance at a modern (i.e. pay-to-win) mobile MMORPG: you need to be good to perform near the top, but the other factor that separates winners and losers is being willing to invest much more than others in loot boxes and items (i.e. time spent forecasting) because you really want to win.
In this analysis is there any assumption about information states? Is the idea that the forecasts are all based on public information everyone has available to them? Or can that explain part of the different performance and then we need to look at a subset with perhaps better access to the information and see how they perform against one another—or various types of informational asymmetries or institutional factors related to the information.
Superforecasters used only public information, or information they happened to have access to—but the original project was run in parallel with a (then secret) prediction platform for inside the intelligence community. It turned out that the intelligence people were significantly outperformed by superforecasters, despite having access to classified information and commercial information sources, so it seems clear that the information access wasn’t particularly critical for the specific class of geopolitical predictions they looked at. This is probably very domain dependent, however.
The clear answer to the question posed, “do the performances of GJP participants follow a power-law distribution, such that the best 2% are significantly better than the rest” is yes—with a minor quibble, and a huge caveat. (Epistemic status: I’m very familiar with the literature, have personal experience as a superforecaster since the beginning, had discussions with Dan Gardner and the people running, have had conversations with the heads of Good Judgement Inc, etc.)
The minor quibble is identified in other comments, that it is unlikely that there is a sharp cutoff at 2%, there isn’t a discontinuity, and power law is probably the wrong term. Aside from those “minor” issues, yes, there is a clear group of people who outperformed multiple years in a row, and this groups was fairly consistent from year to year. Not only that, but the order withing that group is far more stable than chance. That clearly validates the claim that “superforcasters are a real thing.”
But the data that those people are better is based on a number of things, many of which aren’t what you would think. First, the biggest difference between top forecasters and the rest is frequency of updates and a corresponding willingness to change their minds as evidence comes in. People who invest time in trying to forecast well do better than those who don’t—to that extent, it’s a skill like most others. Second, success at forecasting is predicted by most of the things that predict success at almost everything else—intelligence, time spent, and looking for ways to improve. Some of the techniques that Good Judgement advocates for superforecasters are from people who read Kahneman and Twersky, Tetlock, and related research, and tried to apply the ideas. The things that worked were adopted—but not everything helped. Other techniques were original to the participants—for instance, explicitly comparing your estimate for a question based on different timeframes, to ensure it is a coherent and reasonable probability. (Will X happen in the next 4 months? If we changed that to one month, would be estimate be about a quarter as high? What about if it were a year? If my intuition for the answer is about the same, I need to fix that.) Ideas like this are not natural ability, they are just applying intelligence to a problem they care about.
Also, many of the poorer performers were people who didn’t continue forecasting, and their initial numbers got stale—they presumably would have updated. The best performers, on the other hand, checked the news frequently, and updated. At times, we would change a forecast once the event had / had not happened, a couple days before the question was closed, yielding a reasonably large “improvement” in our time-weighted score. This isn’t a function of being naturally better—it’s just the investment of time that helps. (This also explains a decent part of why weighting recency in aggregate scores is helpful—it removes stale forecasts.)
So in short, I’m unconvinced that superforecasters are a “real” thing, except in the sense that most people don’t try, and people who do will do better, and improve over time. Given that, however, we absolutely should rely on superforecasters to make better predictions that the rest of people—as long as they continue doing the things that make them good forecasters.
I think one really important decision-relevant question is:
My impression is that the Good Judgement Project used several tests to attempt to identify forecasters, but the tests didn’t predict the superforecasters as well as what some may have desired.
Do you think that almost all of this can be explained either by:
Diligence to the questions, similar to your example of the MMORPG?
Other simple things that we may be able to figure out in the next few years?
If so, I imagine the value of being a “superforecaster” would go down a bit, but the value of being “a superforecaster in expectation” would go up.
Yes—I suspect a large amount of the variance is explained by features we can measure, and the residual may be currently unexplained, but filtering on the features you can measure probably gets most of what is needed.
However, I don’t think the conclusion necessarily follows.
The problem is a causal reasoning / incentive issue (because of reasons) - just because people who update frequently do well doesn’t mean that telling people you’ll pay those who update frequently will cause them to do better now that they update more often. For instance, if you took MMORPG players and gave them money on condition that they spend money on the game, you’ll screw up the relationship between spending and success.
Fair point. I’m sure you expect some correlation between the use of reasonable incentive structures and useful updating though. It may not be perfect, but I’d be surprised if it were 0.
Agreed.
Thanks for your reply!
It looks to me like we might be thinking about different questions. Basically I’m just concerned about the sentence “Philip Tetlock discovered that 2% of people are superforecasters.” When I read this sentence, it reads to me like “2% of people are superheroes” — they have performance that is way better than the rest of the population on these tasks. If you graphed “jump height” of the population and 2% of the population is Superman, there would be a clear discontinuity at the higher end. That’s what I imagine when I read the sentence, and that’s what I’m trying to get at above.
It looks like you’re saying that this isn’t true?
(It looks to me like you’re discussing the question of how innate “superforecasting” is. To continue the analogy, whether superforecasters have innate powers like Superman or are just normal humans who train hard like Batman. But I think this is orthogonal to what I’m talking about. I know the sentence “are superforecasters a ‘real’ phenomenon” has multiple operationalizations, which is why I specified one as what I was talking about.)
But note that the section you quote from Vox doesn’t say that there’s any discontinuity:
A power law distribution is not a discontinuity! Some people are way way better than others. Other people are merely way better than others. And still others are only better than others.
I think the sentence is misleading (as per Scott Alexander). A better sentence should give the impression that, by way of analogy, some basketball players are NBA players. They may seem superhuman in their basketball ability compared to the Average Joe. And there are a combination of innate traits as well as honed skills that got them there. These would be interesting to study if you wanted to know how to play basketball well. Or if you were putting together a team to play against the Monstars.
But there’s no discontinuity. Going down the curve from NBA players, you get to professional players in other leagues, and then to division 1 college players, and then division 2, etc. Somewhere after bench warmer on their high school basketball team, you get to Average Joe.
So SSC and Vox are both right. Some people are way way better than others (with a power law-like distribution), but there’s no discontinuity.
This analogy seems like a good way of explaining it. Saying (about forecasting ability) that some people are superforecasters is similar to saying (about basketball ability) that some people are NBA players or saying (about chess ability) that some people are Grandmasters. If you understand in detail the meaning of any one of these claims (or a similar claim about another domain besides forecasting/basketball/chess), then most of what you could say about that claim would port over pretty straightforwardly to the other claims.
(I’ll back off the Superman analogy; I think it’s disanalogous b/c of the discontinuity thing you point out.)
Yeah I like the analogue “some basketball players are NBA players.” It makes it sound totally unsurprising, which it is.
I don’t agree that Vox is right, because:
- I can’t find any evidence for the claim that forecasting ability is power-law distributed, and it’s not clear what that would mean with Brier scores (as Unnamed points out).
- Their use of the term “discovered.”
I don’t think I’m just quibbling over semantics; I definitely had the wrong idea about superforecasters prior to thinking it through, it seems like Vox might have it too, and I’m concerned others who read the article will get the wrong idea as well.
From participating on Metaculus I certainly don’t get the sense that there are people who make uncannily good predictions. If you compare the community prediction to the Metaculus prediction, it looks like there’s a 0.14 difference in average log score, which I guess means a combination of the best predictors tends to put e^(0.14) or 1.15 times as much probability on the correct answer as the time-weighted community median. (The postdiction is better, but I guess subject to overfitting?) That’s substantial, but presumably the combination of the best predictors is better than every individual predictor. The Metaculus prediction also seems to be doing a lot worse than the community prediction on recent questions, so I don’t know what to make of that. I suspect that, while some people are obviously better at forecasting than others, the word “superforecasters” has no content outside of “the best forecasters” and is just there to make the field of research sound more exciting.
Agreed. As I said, “it is unlikely that there is a sharp cutoff at 2%, there isn’t a discontinuity, and power law is probably the wrong term.”
As you concluded in other comments, this is wrong. But there doesn’t need to be a sharp cutoff for there to be “way better” performance. If the top 1% consistently have brier scores on a class of questions of 0.01, the next 1% have brier scores of 0.02, and so on, you’d see “way better performance” without a sharp cutoff—and we’d see that the median brier score of 0.5, exactly as good as flipping a coin, is WAY worse than the people at the top. (Let’s assume everyone else is at least as good as flipping a coin, so the bottom half are all equally useless.)
If there isn’t a discontinuity, then how is there a clear group that outperformed?
See: https://www.lesswrong.com/posts/uoyn67q3HtB2ns2Yg/are-superforecasters-a-real-phenomenon#e9uGgK7PinFgK2o2z
I’d consider this something like Superforecasting as a continuum rather than a category in that case, and 2% seems quite arbitrary as does calling them superforecasters.
That makes sense as an approach—but as mentioned initially, I think the issue with calling people superforecasters is deeper, since it’s unclear how much of the performance is even about their skill, rather than other factors.
Instead of basketball and the NBA, I’d compare superforecasting to performance at a modern (i.e. pay-to-win) mobile MMORPG: you need to be good to perform near the top, but the other factor that separates winners and losers is being willing to invest much more than others in loot boxes and items (i.e. time spent forecasting) because you really want to win.
In this analysis is there any assumption about information states? Is the idea that the forecasts are all based on public information everyone has available to them? Or can that explain part of the different performance and then we need to look at a subset with perhaps better access to the information and see how they perform against one another—or various types of informational asymmetries or institutional factors related to the information.
Superforecasters used only public information, or information they happened to have access to—but the original project was run in parallel with a (then secret) prediction platform for inside the intelligence community. It turned out that the intelligence people were significantly outperformed by superforecasters, despite having access to classified information and commercial information sources, so it seems clear that the information access wasn’t particularly critical for the specific class of geopolitical predictions they looked at. This is probably very domain dependent, however.
Thanks. Interesting, though not too surprising in some ways.