How deferential should we be to the forecasts of subject matter experts?
This post explores the question: how strongly should we defer to predictions and forecasts made by people with domain expertise? I’ll assume that the domain expertise is legitimate, i.e., the people with domain expertise do have a lot of information in their minds that non-experts don’t. The information is usually not secret, and non-experts can usually access it through books, journals, and the Internet. But experts have more information inside their head, and may understand it better. How big an advantage does this give them in forecasting?
Tetlock and expert political judgment
In an earlier post on historical evaluations of forecasting, I discussed Philip E. Tetlock’s findings on expert political judgment and forecasting skill, and summarized his own article for Cato Unbound co-authored with Dan Gardner that in turn summarized the themes of the book:
The average expert’s forecasts were revealed to be only slightly more accurate than random guessing—or, to put more harshly, only a bit better than the proverbial dart-throwing chimpanzee. And the average expert performed slightly worse than a still more mindless competition: simple extrapolation algorithms that automatically predicted more of the same.
The experts could be divided roughly into two overlapping yet statistically distinguishable groups. One group (the hedgehogs) would actually have been beaten rather soundly even by the chimp, not to mention the more formidable extrapolation algorithm. The other (the foxes) would have beaten the chimp and sometimes even the extrapolation algorithm, although not by a wide margin.
The hedgehogs tended to use one analytical tool in many different domains; they preferred keeping their analysis simple and elegant by minimizing “distractions.” These experts zeroed in on only essential information, and they were unusually confident—they were far more likely to say something is “certain” or “impossible.” In explaining their forecasts, they often built up a lot of intellectual momentum in favor of their preferred conclusions. For instance, they were more likely to say “moreover” than “however.”
The foxes used a wide assortment of analytical tools, sought out information from diverse sources, were comfortable with complexity and uncertainty, and were much less sure of themselves—they tended to talk in terms of possibilities and probabilities and were often happy to say “maybe.” In explaining their forecasts, they frequently shifted intellectual gears, sprinkling their speech with transition markers such as “although,” “but,” and “however.”
It’s unclear whether the performance of the best forecasters is the best that is in principle possible.
This widespread lack of curiosity—lack of interest in thinking about how we think about possible futures—is a phenomenon worthy of investigation in its own right.
Tetlock has since started The Good Judgment Project (website, Wikipedia), a political forecasting competition where anybody can participate, and with a reputation of doing a much better job at prediction than anything else around. Participants are given a set of questions and can basically collect freely available online information (in some rounds, participants were given additional access to some proprietary data). They then use that to make predictions. The aggregate predictions are quite good. For more information, visit the website or see the references in the Wikipedia article. In particular, this Economist article and this Business Insider article are worth reading. (I discussed the GJP and other approaches to global political forecasting in this post).
So at least in the case of politics, it seems that amateurs, armed with basic information plus the freedom to look around for more, can use “fox-like” approaches and do a better job of forecasting than political scientists. Note that experts still do better than ignorant non-experts who are denied access to information. But once you have basic knowledge and are equipped to hunt more down, the constraining factor does not seem to be expertise, but rather, the approach you use (fox-like versus hedgehog-like). This should not be taken as a claim that expertise is irrelevant or unnecessary to forecasting. Experts play an important role in expanding the scope of knowledge and methodology that people can draw on to make their predictions. But the experts themselves, as people, do not have a unique advantage when it comes to forecasting.
Tetlock’s research focused on politics. But the claim that the fox-hedgehog distinction turns out to be a better prediction of forecasting performance than the level of expertise is a general one. How true is this claim in domains other than politics? Domains such as climate science, economic growth, computing technology, or the arrival of artificial general intelligence?
Armstrong and Green again
J. Scott Armstrong is a leading figure in the forecasting community. Along with Kesten C. Green, he penned a critique of the forecasting exercises in climate science in 2007, with special focus on the IPCC reports. I discussed the critique at length in my post on the insularity critique of climate science. Here, I quote a part from the introduction of the critique that better explains the general prior that Armstrong and Green claim to be bringing to the table when they begin their evaluation. Of the points they make at the beginning, two bear directly on the deference we should give to expert judgment and expert consensus:
Unaided judgmental forecasts by experts have no value: This applies whether the opinions are expressed in words, spreadsheets, or mathematical models. It applies regardless of how much scientific evidence is possessed by the experts. Among the reasons for this are:
a) Complexity: People cannot assess complex relationships through unaided observations.
b) Coincidence: People confuse correlation with causation.
c) Feedback: People making judgmental predictions typically do not receive unambiguous feedback they can use to improve their forecasting.
d) Bias: People have difficulty in obtaining or using evidence that contradicts their initial beliefs. This problem is especially serious for people who view themselves as experts.Agreement among experts is only weakly related to accuracy: This is especially true when the experts communicate with one another and when they work together to solve problems, as is the case with the IPCC process.
Armstrong and Green later elaborate on these claims, referencing Tetlock’s work. (Note that I have removed the parts of the section that involve direct discussion of climate-related forecasts, since the focus here is on the general question of how much deference to show to expert consensus).
Many public policy decisions are based on forecasts by experts. Research on persuasion has shown that people have substantial faith in the value of such forecasts. Faith increases when experts agree with one another. Our concern here is with what we refer to as unaided expert judgments. In such cases, experts may have access to empirical studies and other information, but they use their knowledge to make predictions without the aid of well-established forecasting principles. Thus, they could simply use the information to come up with judgmental forecasts. Alternatively, they could translate their beliefs into mathematical statements (or models) and use those to make forecasts.
Although they may seem convincing at the time, expert forecasts can make for humorous reading in retrospect. Cerf and Navasky’s (1998) book contains 310 pages of examples, such as Fermi Award-winning scientist John von Neumann’s 1956 prediction that “A few decades hence, energy may be free”. [...] The second author’s review of empirical research on this problem led him to develop the “Seer-sucker theory,” which can be stated as “No matter how much evidence exists that seers do not exist, seers will find suckers” (Armstrong 1980). The amount of expertise does not matter beyond a basic minimum level. There are exceptions to the Seer-sucker Theory: When experts get substantial well-summarized feedback about the accuracy of their forecasts and about the reasons why their forecasts were or were not accurate, they can improve their forecasting. This situation applies for short-term (up to five day) weather forecasts, but we are not aware of any such regime for long-term global climate forecasting. Even if there were such a regime, the feedback would trickle in over many years before it became useful for improving forecasting.
Research since 1980 has provided much more evidence that expert forecasts are of no value. In particular, Tetlock (2005) recruited 284 people whose professions included, “commenting or offering advice on political and economic trends.” He asked them to forecast the probability that various situations would or would not occur, picking areas (geographic and substantive) within and outside their areas of expertise. By 2003, he had accumulated over 82,000 forecasts. The experts barely if at all outperformed non-experts and neither group did well against simple rules. Comparative empirical studies have routinely concluded that judgmental forecasting by experts is the least accurate of the methods available to make forecasts. For example, Ascher (1978, p. 200), in his analysis of long-term forecasts of electricity consumption found that was the case.
Note that the claims that Armstrong and Green make are in relation to unaided expert judgment, i.e., expert judgment that is not aided by some form of assistance or feedback that promotes improved forecasting. (One can argue that expert judgment in climate science is not unaided, i.e., that the critique is mis-applied to climate science, but whether that is the case is not the focus of my post). While Tetlock’s suggestion to be more fox-like, Armstrong and Green recommend the use of their own forecasting principles, as encoded in their full list of principles and described on their website.
A conflict of intuitions, and an attempt to resolve it
I have two conflicting intuitions here. I like to use the majority view among experts as a reasonable Bayesian prior to start with, that I might then modify based on further study. The relevant question here is who the experts are. Do I defer to the views of domain experts, who may know little about the challenges of forecasting, or do I defer to the views of forecasting experts, who may know little of the domain but argue that domain experts who are not following good forecasting principles do not have any advantage over non-experts?
I think the following heuristics are reasonable starting points:
In cases where we have a historical track record of forecasts, we can use that to evaluate the experts and non-experts. For instance, I reviewed the track record of survey-based macroeconomic forecasts, thanks to a wealth of recorded data on macroeconomic forecasts by economists over the last few decades. (Unfortunately, these surveys did not include corresponding data on layperson opinion).
The faster the feedback from making a forecast to knowing whether it’s right, the more likely it is that experts would have learned how to make good forecasts.
The more central forecasting is to the overall goals of the domain, the more likely people are to get it right. For instance, forecasting is a key part of weather and climate science. But forecasting progress on mathematical problems has a negligible relation with doing mathematical research.
Ceteris paribus, if experts are clearly recording their forecasts and the reasons behind them, and systematically evaluating the performance on past forecasts, that should be taken as (weak) evidence in favor of the experts’ views being taken more seriously (even if we don’t have enough of a historical track record to properly calibrate forecast accuracy). However, if they simply make forecasts but then fail to review their past history of forecasts, this may be taken as being about as bad as not forecasting at all. And in cases that the forecasts were bold, failed miserably, and yet the errors were not acknowledged, this should be taken as being considerably worse than not forecasting at all.
A weak inside view of the nature of domain expertise can give some idea of whether expertise should generally translate to better forecasting skill. For instance, even a very weak understanding of physics will tell us that physicists are no more likely to determine whether a coin toss will yield heads or tails, even though the fate of the coin is determined by physics. Similarly, with the exception of economists who specialize in the study of macroeconomic indicators, one wouldn’t expect economists to be able to forecast macroeconomic indicators better than most moderately economically informed people.
Politicization?
My first thought was that the more politicized a field, the less reliable any forecasts coming out of it. I think there are obvious reasons for that view, but there are also countervailing considerations.
The main claimed danger of politicization is groupthink and lack of openness to evidence. It could even lead to suppression, misrepresentation, or fabrication of evidence. Quite often, however, we see these qualities in highly non-political fields. People believe that certain answers are the right ones. Their political identity or ego is not attached to it. They just have high confidence that that answer is correct, and when the evidence they have does not match up, they think there is a problem with the evidence. Of course, if somebody does start challenging the mainstream view, and the issue is not quickly resolved either way, it can become politicized, with competing camps of people who hold the mainstream view and people who side with the challengers. Note, however, that the politicization has arguably reduced the aggregate amount of groupthink in the field. Now that there are two competing camps rather than one received wisdom, new people can examine evidence and better decide which camp is more on the side of truth. People in both camps, now that they are competing, may try to offer better evidence that could convince the undecideds or skeptics. So “politicization” might well improve the epistemic situation (I don’t doubt that the opposite happens quite often). Examples of such politicization might be the replacement of geocentrism by heliocentrism, the replacement of creationism by evolution, and the replacement of Newtonian mechanics by relativity and/or quantum mechanics. In the first two cases, religious authorities pushed against the new idea, even though the old idea had not been a “politicized” tenet before the competing claims came along. In the case of Newtonian and quantum mechanics, the debate seems to have been largely intra-science, but quantum mechanics had its detractors, including Einstein, famous for the “God does not play dice” quip. (This post on Slate Star Codex is somewhat related).
The above considerations aren’t specific to forecasting, and they apply even for assertions that fall squarely within the domain of expertise and require no forecasting skill per se. The extent to which they apply to forecasting problems is unclear. It’s unclear whether most domains have any significant groupthink in favor of particular forecasts. In fact, in most domains, forecasts aren’t really made or publicly recorded at all. So concerns of groupthink in a non-politicized scenario may not apply to forecasting. Perhaps the problem is the opposite: forecasts are so unimportant in many domains that the forecasts offered by experts are almost completely random and hardly informed in a systematic way by their expert knowledge. Even in such situations, politicization can be helpful, in so far as it makes the issue more salient and might prompt individuals to give more attention to trying to figure out which side is right.
The case of forecasting AI progress
I’m still looking at the case of forecasting AI progress, but for now, I’d like to point people to Luke Muehlhauser’s excellent blog post from May 2013 discussing the difficulty with forecasting AI progress. Interestingly, he makes many points similar to those I make here. (Note: Although I had read the post around the time it was published, I hadn’t read it recently until I finished drafting the rest of my current post. Nonetheless, my views can’t be considered totally independent of Luke’s because we’ve discussed my forecasting contract work for MIRI).
Should we expect experts to be good at predicting AI, anyway? As Armstrong & Sotala (2012) point out, decades of research on expert performance2 suggest that predicting the first creation of AI is precisely the kind of task on which we should expect experts to show poor performance — e.g. because feedback is unavailable and the input stimuli are dynamic rather than static. Muehlhauser & Salamon (2013) add, “If you have a gut feeling about when AI will be created, it is probably wrong.”
[...]
On the other hand, Tetlock (2005) points out that, at least in his large longitudinal database of pundit’s predictions about politics, simple trend extrapolation is tough to beat. Consider one example from the field of AI: when David Levy asked 1989 World Computer Chess Championship participants when a chess program would defeat the human World Champion, their estimates tended to be inaccurately pessimistic,8 despite the fact that computer chess had shown regular and predictable progress for two decades by that time. Those who forecasted this event with naive trend extrapolation (e.g. Kurzweil 1990) got almost precisely the correct answer (1997).
Looking for thoughts
I’m particularly interested in thoughts from people on the following fronts:
What are some indicators you use to determine the reliability of forecasts by subject matter experts?
How do you resolve the conflict of intuitions between deferring to the views of domain experts and deferring to the conclusion that forecasters have drawn about the lack of utility of domain experts’ forecasts?
In particular, what do you think of the way that “politicization” affects the reliability of forecasts?
Also, how much value do you assign to agreement between experts when judging how much trust to place in expert forecasts?
Comments that elaborate on these questions or this general topic within the context of a specific domain or domains would also be welcome.
- Tentative tips for people engaged in an exercise that involves some form of prediction or forecasting by 30 Jul 2014 5:24 UTC; 14 points) (
- 6 Jun 2015 18:14 UTC; 2 points) 's comment on Summary of my Participation in the Good Judgment Project by (
Thank you, I think this has direct applications for evaluating research in everyday life. Specifically, the most valuable role an expert can play is not synthesis of evidence (presenting a conclusion) but simply making sure that there is a correct overview of what evidence there is available. I should increase my credence in experts who seem to be engaging in this (collection and summary of evidence) behavior and lower my credence in experts who engage in lots of synthesis. Likewise, I should also not bother synthesizing but devote effort towards finding the best evidence available, collating it, and then getting feedback from lots of others on what the best synthesis would be. Perhaps doing things like Rot13 my own preliminary conclusions and ask people to comment on the evidence before reading them.
The key is to see how much their experience in the subject matter has facilitated being an expert on forecasting the subject matter, or forecasting in general. A doctor may be expert in the mechanics of a disease process, but wholly incompetent in statistical inference about that disease process.
Good post. To which I would add...
There is much more to expertise than forecasting. Also
Designing and building “things” of various kinds that work
Fixing “things”
“Things” could include social systems, people, business structures, advertising campaigns, not just machines of course.
A person may be a very good football coach in the sense of putting together a team that wins, or fixing a losing team, but may not be too good at making predictions. Doctors are notoriously bad a predicting patient outcomes but are often be very skilled at actually treating them.
I think to a degree you confuse assessing whether a group does have expertise with assessing whether they are *likely* to have expertise.
As far as factors that count against expertise being reliably or significantly present, to your politics I would add
1. Money. The medical literature is replete with studies showing huge effect sizes from “who paid the piper”. In pharmaceutical research this seems to result in a 4X different chance of a positive result. But there is more than this; the ability to offer speaking and consultancy fees, funding of future projects etc can have a powerful effect.
Another example is the problem alluded to in relation to the consensus about the historical Jesus. When a field is dominated by people whose livelihood depends on their continuing to espouse a certain belief, the effect goes beyond those individuals and infects the whole field.
2. The pernicious effects of “great men” who can suppress dissent against their out of date views. Is the field pluralistic, realistically, and is dissent allowed? Science advances funeral by funeral. Have a look at what happened to John “Pure white and deadly” Yudkin.
3. Politics beyond what we normally think of as politics. Academia is notoriously “political” in this wider sense. Amplifying your point about reality checks, if feedback is not accurate, rapid, and unambiguous, it is hard for people in the field to know who is right, if anyone.
4. “Publish” or perish. There are massive incentives to get published or to get publicity or a high profile. This leads to people claiming expertise, results, achievements that are bogus. Consider for example the case of Theranos, which seemed, if media reports are accurate, to have no useful ability to build systems that did pathology tests, yet apparently hoodwinked many into thinking that they did.
You make a good point that claims of expertise without evidence or, worse, in the face of adverse evidence, are really really bad. I would go as far as to say that if you claim expertise but cannot prove it, I have a strong prior that you don’t have it.
There are large groups of self-described experts who do not have expertise or at best have far less than they think. One should be alert to the possibility that “experts” aren’t.
Also to reinforce a very important point: even when experts are not very expert, they are probably a lot better than you+google+30minutes!
I believe we should use analytics to find the commonalities in the opinions of groups of interacting experts
http://arxiv.org/abs/1406.7578