Book review: The Book of Why, by Judea Pearl and Dana MacKenzie.
This book aims to turn the ideas from Pearl’s seminal Causality) into something that’s readable by a fairly wide audience.
It is somewhat successful. Most of the book is pretty readable, but parts of it still read like they were written for mathematicians.
History of science
A fair amount of the book covers the era (most of the 20th century) when statisticians and scientists mostly rejected causality as an appropriate subject for science. They mostly observed correlations, and carefully repeated the mantra “correlation does not imply causation”.
Scientists kept wanting to at least hint at causal implications of their research, but statisticians rejected most attempts to make rigorous claims about causes.
The one exception was for randomized controlled trials (RCTs). Statisticians figured out early on that a good RCT can demonstrate that correlation does imply causation. So RCTs became increasingly important over much of the 20th century[1].
That created a weird tension, where the use of RCTs made it clear that scientists valued the concept of causality, but in most other contexts they tried to talk as if causality wasn’t real. Not quite as definitely unreal as phlogiston. A bit closer to how behaviorists often tabooed the ideas that we had internal experiences and consciousness, or how linguists once banned debates on the origin of language, namely, that it was dangerous to think science could touch those topics. Or maybe a bit like heaven and hell—concepts which, even if they are useful, seem to be forever beyond the reach of science?
But scientists kept wanting to influence the world, rather than just predict it. So they often got impatient, when they couldn’t afford to wait for RCTs, to act as if correlations told them something about causation.
The most conspicuous example is smoking. Scientists saw many hints that smoking caused cancer, but without an RCT[2], their standards and vocabulary made it hard to say more than that smoking is associated with cancer.
This eventually prompted experts to articulate criteria that seemed somewhat useful at establishing causality. But even in ideal circumstances, those criteria weren’t convincing enough to produce a consensus. Authoritative claims about smoking and cancer were delayed for years by scientists’ discomfort with talking about causality[3].
It took Pearl to describe how to formulate a unambiguous set of causal claims, and then say rigorous things about whether the evidence confirms or discredits the claims.
What went wrong?
The book presents some good hints about why the concept of causality was tabooed from science for much of the 20th century.
It focuses on the role of R.A. Fisher (also known as one of the main advocates of frequentism). Fisher was a zealot whose prestige was somewhat heavily based on his skill at quantifying uncertainty. In contrast, he didn’t manage to quantify causality, or even figure out how to talk clearly about it. Pearl hints that this biased him against causal reasoning.
path analysis requires scientific thinking, as does every exercise in causal inference. Statistics, as frequently practiced, discourages it, and encouraged “canned” procedures instead.
But blaming a few influential people seems to merely describe the tip of the iceberg. Why did scientists as a group follow Fisher’s lead?
I suggest that the iceberg is better explained by what James C. Scott describes as high modernism and the desire for legibility.
I see a similar same pattern in the 20th century dominance of frequentism in most fields of science and the rejection of Bayesian approaches. Anything that required priors (whose source often couldn’t be rigorously measured) was at odds with the goal of legibility.
The rise and fall of the taboo on causal inference coincide moderately well with the rise and fall of Soviet-style central planning, planned cities, and Taylorist factory management.
I also see some overlap with behaviorism, with its attempt to deny the importance of variables that were hard to measure, and its utopian hopes for how much its techniques could accomplish.
These patterns all seem to all be rooted in overconfident extrapolations of simple models of what caused progress. I don’t think it’s an accident that they all peaked near the middle of the 20th century, and were mostly discredited by the end of the century.
I remember that when I was young, I supported the standard inferences from the “correlation does not imply causation” mantra, and was briefly (and less clearly) tempted by the other manifestations of high modernism. Alas, I don’t remember my reasons for doing so well enough to be of much use, other than a semi-appropriate respect for the authorities who were promoting those ideas.
An example of why causal reasoning matters
Here’s an example that the book provides, dealing with non-randomized studies of a fictitious drug (to illustrate Simpson’s Paradox, but also to show the difference between statistics and causal inference). The studies quantify three variables in each study:
Study 1: drug ← gender → heart attacks
Study 2: drug → blood pressure → heart attacks
The book asks how we know we should treat the middle variables in those studies differently. The examples come with identical numbers, so that a statistics program which only sees correlations, and can’t understand the causal arrows I’ve drawn here, would analyze both studies using the same methods. The numbers in these studies are chosen so that the aggregate data suggest an opposite conclusion about the drug from what we see if we stratify by gender or blood pressure. Standard statistics won’t tell us which way of looking at data is more informative. But if we apply a little extra knowledge, it becomes clear that gender was a confounding variable that should be controlled for (it influenced who decided to take the drug), whereas blood pressure was a mediator that tells us how the drug works, and shouldn’t be controlled for.
People typically don’t find it hard to distinguish between the hypothesis that a drug caused a change in blood pressure and the hypothesis that a drug changed patients’ reported gender. We all have a sufficiently sophisticated model of the world to assume the drug isn’t changing patients’ gender identity (i.e. we know that if that assumption were unexpectedly false, we’d hear about it).
Yet canned programs today are not designed to handle that, and it will be hard to fix programs so that they have the common sense needed to make those distinctions over a wide variety of domains.
Continuing Problems?
Pearl complains about scientists controlling for too many variables. The example described above helps explain why controlling for variables is often harmful, when it’s not informed by a decent causal model. I have been mildly suspicious of the controlling for more variables is better attitude in the past, but this book clarified the problems well enough that I should be able to distinguish sensible from foolish attempts at controlling for variables.
Controlling for confounders seems like an area where science still has a long way to go before it can live up to Pearl’s ideals.
There’s also some lingering high modernism affecting the status of RCTs relative to other ways of inferring causality.
A sufficiently well-run RCT can at least create the appearance that everything important has been quantified. Sampling errors can be reliably quantified. Then the experimenter can sweep any systemic bias under the rug, and declare that the hypothesis formation step lies outside of science, or maybe deny that hypotheses matter (maybe they’re just looking through all the evidence to see what pops out).
It looks to me like the peer review process still focuses too heavily on the easy-to-quantify and easy-to-verify steps in the scientific process (i.e. p-values). When RCTs aren’t done, researchers too often focus on risk factors and associations, to equivocate about whether the research enlightens us about causality.
AI
The book points out that an AI will need to reason causally in order to reach human-level intelligence. It seems like that ought to be uncontroversial. I’m unsure whether it actually is uncontroversial.
But Pearl goes further, saying that the lack of causal reasoning in AIs has been “perhaps the biggest roadblock” to human-level intelligence.
I find that somewhat implausible. My intuition is that general-purpose causal inference won’t be valuable in AIs until those AIs have world-models which are at least as sophisticated as crows[4], and that when that level is reached, we’ll get rapid progress at incorporating causal inference into AI.
It’s true that AI research often focuses on data mining (blind empiricism / model-free approaches), at the expense of approaches that could include causal inference. High modernist attitudes may well have hurt AI research in the past, and that may still be slowing AI research a bit. But Pearl exaggerates these effects.
To the extent that Pearl identifies tasks that AI can’t yet tackle (e.g. “What kinds of solar systems are likely to harbor Earth-like planets?”), they need not just causal reasoning, but also the ability to integrate knowledge from a wide variety of data sources—and that means learning a much wider variety of concepts in a single system than AI researchers currently have the power to handle.
I expect that mainstream machine learning is mostly on track to handle that variety of concepts any decade now. I expect that until then, AI will only be able to do causal reasoning on toy problems, regardless of how well it understands causality.
Conclusion
Pearl is great at evaluating what constitutes clear thinking about causality. He’s somewhat good at teaching us how to think clearly about novel causal problems, and rather unremarkable when he ventures outside the realm of causal inference.
Footnotes
[1] - RCTs (and p-values) don’t seem to be popular in physics or geology. I’m curious why Pearl doesn’t find this worth noting. I’ve mentioned before that people seem to care about statistical significance mainly where powerful interest groups might benefit from false conclusions.
[2] - The book claims that an RCT for smoking “would be neither feasible nor ethical”. Clarke’s first law applies here: it looks like about 8 studies had some sort of randomized interventions which altered smoking rates, including two studies focused solely on smoking interventions, which generated important reductions in smoking in the control group.
The RCTs seem to confirm that smoking causes health problems such as lung cancer and cardiovascular disease, but suggest that smoking shortens lifespan by a good deal less than the correlations would indicate.
[3] - As footnote 2 suggests, there have been some legitimate puzzles about the effects of smoking. Those sources of uncertainty have been obscured by the people who signal support for the “smoking is evil” view, and by smokers and tobacco companies who cling to delusions.
Smokers probably have some unhealthy habits and/or genes that contribute to cancer via causal pathways other than smoking.
The book notes that there is a “smoking gene” (rs16969968, aka Mr Big), but mostly it just means that smoking causes more harm for people with that gene.
Yet the book mostly implies that the anti-smoking crusaders were at least 90% right about the effects of smoking, when I think the reality is more complicated.
Pearl thinks quite rigorously when he’s focused exclusively on causal inference, but outside that domain of expertise, he comes across as no more careful than an average scientist.
[4] - Pearl would have us believe that causal reasoning is mostly a recent human invention (in the last 50,000 years). I find Wikipedia’s description of non-human causal reasoning to be more credible.
I am confused by what you think. You write in a footnote that you agree with Wikipedia’s description, which says crows have causal reasoning. That would likely imply that AIs won’t get world-models as sophisticated as crows’ world models until they have causal reasoning.
I meant as sophisticated as crows in terms of basic pattern recognition, and the number, diversity, and generality of the concepts they can learn. Maybe that just means throwing more cpu power at existing ML approaches. Maybe that requires better ways of integrating a more diverse set of approaches into a single system.
Maybe I don’t have a clear enough meaning of “sophisticated” to be of much value here.
I’m a month late to the party, but wanted to chime in anyway.
Yes, there certainly seems to be a correlation. But I think in this case this can be mostly understood by considering the subject matter as a confounding variable—powerful interest groups mostly care about research that can influence policy-making, which are usually the Social Sciences. And it just so happens that designing a replicable study that measures exactly what you are looking for is a lot easier in physics/geology/astronomy than it is in social science (who knew, the brain is complicated). So Social Science gets stuck with proxies and replication crises and wiggle room for foul play. It is a relatively sensible reaction then to demand some amount of standardization (standard statistical tools, standard experimental methods etc.) within the field. Or, put more bluntly, if you cannot have good faith in the truth of the conclusions of your peers’ research (mostly just because the subject is more difficult than any training prepared them for, not because your peers are evil) you need to artificially create some common ground to start repairing that sweet exponential growth curve.
What evidence makes you say p-values aren’t popular in physics? My passing (and mainly secondhand) understanding of cosmology is that it uses ~no RCTs but is very wedded to p-values (generally around the 5 sigma level).
You’re right. I was trying to summarize ideas from the book The Cult of Statistical Significance, but that book now looks slightly misleading, and my summary was more misleading.
There are some important ways in which physics rejected significant parts of Fisher’s ideas, but I guess I should describe them more as rejecting dogma than as rejecting p-values.
I think part of the problem is that the scientific community lacks an effective empiric way to distinguish between different ways of statistical reasoning.
As a result a field like social neuroscience degenerated into being largely a cargo-cult science that manages to predict based on brain scans things better than it would be possible theoretically. They achieve this feat by “predicting” the data that was used to train their models.
When it’s as easy to publish with methods that can find patterns where there are none as publishing with methods that require real pattern in the data, scientists in the field will be pressured into the cargo-cult direction.
To get good statistics we would actually need to have a new Gold standard of evaluating whether people know something.
Despite Pearl’s early work on Bayesian networks, he doesn’t seem to be very familiar with Bayesian statistics—the above comment really only applies to frequentist statistics. Model construction and criticism (“scientific thinking”) is an important part of Bayesian statistics. Causal thinking is common in Bayesian statistics, because causal intuition provides the most effective guide for Bayesian model building.
I’ve worked implementing Bayesian models of consumer behavior for marketing research, and these are grounded in microeconomic theory, models of consumer decision making processes, common patterns of deviation from strictly rational choice, etc.
That quote is from a section on history, with the context implying that “as frequently practiced” is likely to refer to an average over the 20th century, not a description of 2018.