I will give a potted history of Pearl’s discovery as I understand it.
In the late 70s/early 80s, people wanted to deal with uncertainty in logic-based AI. The obvious thing to use is probability, but doing a Bayesian update to compute a posterior is exponentially expensive.
Pearl wanted to come up with a good data structure for doing computations over probability distributions in less-than-exponential time.
He introduced the idea of Bayesian networks in his paper Reverend Bayes On Inference Engines where he represents factorized probability distributions using DAGs. Here, the direction of the arrows is arbitrary and there are many DAGs corresponding to one probability distribution.
He was not thinking about causality at all, it was just a problem in data structures. The idea was this would be used for the same sort of thing as an “expert system” or other logic based AI systems, but taking into account uncertainty expressed probabilistically.
Later, people including Pearl noticed that you can and often should interpret the arrows as causal, this amounts to choosing one DAG from many. The fact that there are many possible DAGs is related to the fact that there are seemingly always multiple incompatible causal stories, to explain observations absent making additional assumptions about the world. But if you pick one, you can start using it to see whether your causal question can be answered from observational data alone.
Finally, he realized that the assumptions encoded in a DAG aren’t sufficient for fully general counterfactuals, and realized that in full generality you have to specify exactly what functional relationship goes along each edge of the graph.
As someone originally concerned with AI, not with problems in the natural sciences, Pearl is probably unusual. Pearl himself looks back on Sewall Wright as his progenitor for coming up with path diagrams—he was working in genetics. If you are interested in this, you should also look at Don Rubin’s experience—his causal framework is isomorphic to Pearl’s. He was a 100 percent classic statistician, motivated by looking at medical studies.
I think another important part of Pearl’s journey was that during his transition from Bayesian networks to causal inference, he was very frustrated with the correlational turn in early 1900s statistics. Because causality is so philosophically fraught and often intractable, statisticians shifted to regressions and other acausal models. Pearl sees that as throwing out the baby (important causal questions and answers) with the bathwater (messy empirics and a lack of mathematical language for causality, which is why he coined the do operator).
Pearl discusses this at length in The Book of Why, particularly the Chapter 2 sections on “Galton and the Abandoned Quest” and “Pearson: The Wrath of the Zealot.” My guess is that Pearl’s frustration with statisticians’ focus on correlation was immediate upon getting to know the field, but I don’t think he’s publicly said how his frustration began.
Rubin’s framework says basically, suppose all our observations are in a big data table. Now consider the counterfactual observations that didn’t happen (i.e. people in the control group getting the treatment) -- these are called “potential outcomes”—treat those like missing cells in the data table. Then causal inference is just to fill in potential outcomes using missing data imputation techniques, although to be valid these require some assumptions about conditional independence.
Pearl’s framework and Rubin’s are isomorphic in the sense that any set of causal assumptions in Pearl’s framework (a structural causal model, which has a DAG structure), can be translated into a set of causal assumptions in Rubin’s framework (a bunch of conditional independence assumptions about potential outcomes), and vice versa. This is touched on somewhat in Ch. 7 of “Causality”.
Pearl argues that despite this equivalence, his framework is superior because it’s a better tool for thinking. In other words, writing down your assumptions as DAG/SCM is intuitive and can be explained and argued about, while he claims the Rubin model independence assumptions are opaque and hard to understand.
From my experience it pays to learn how to think about causal inference like Pearl (graphs, structural equations), and also how to think about causal inference like Rubin (random variables, missing data). Some insights only arise from a synthesis of those two views.
Pearl is a giant in the field, but it is worth remembering that he’s unusual in another way (compared to a typical causal inference researcher) -- he generally doesn’t worry about actually analyzing data.
---
By the way, Gauss figured out not only the normal distribution trying to track down Ceres’ orbit, he actually developed the least squares method, too! So arguably the entire loss minimization framework in machine learning came about from thinking about celestial bodies.
I will give a potted history of Pearl’s discovery as I understand it.
In the late 70s/early 80s, people wanted to deal with uncertainty in logic-based AI. The obvious thing to use is probability, but doing a Bayesian update to compute a posterior is exponentially expensive.
Pearl wanted to come up with a good data structure for doing computations over probability distributions in less-than-exponential time.
He introduced the idea of Bayesian networks in his paper Reverend Bayes On Inference Engines where he represents factorized probability distributions using DAGs. Here, the direction of the arrows is arbitrary and there are many DAGs corresponding to one probability distribution.
He was not thinking about causality at all, it was just a problem in data structures. The idea was this would be used for the same sort of thing as an “expert system” or other logic based AI systems, but taking into account uncertainty expressed probabilistically.
Later, people including Pearl noticed that you can and often should interpret the arrows as causal, this amounts to choosing one DAG from many. The fact that there are many possible DAGs is related to the fact that there are seemingly always multiple incompatible causal stories, to explain observations absent making additional assumptions about the world. But if you pick one, you can start using it to see whether your causal question can be answered from observational data alone.
Finally, he realized that the assumptions encoded in a DAG aren’t sufficient for fully general counterfactuals, and realized that in full generality you have to specify exactly what functional relationship goes along each edge of the graph.
As someone originally concerned with AI, not with problems in the natural sciences, Pearl is probably unusual. Pearl himself looks back on Sewall Wright as his progenitor for coming up with path diagrams—he was working in genetics. If you are interested in this, you should also look at Don Rubin’s experience—his causal framework is isomorphic to Pearl’s. He was a 100 percent classic statistician, motivated by looking at medical studies.
I think another important part of Pearl’s journey was that during his transition from Bayesian networks to causal inference, he was very frustrated with the correlational turn in early 1900s statistics. Because causality is so philosophically fraught and often intractable, statisticians shifted to regressions and other acausal models. Pearl sees that as throwing out the baby (important causal questions and answers) with the bathwater (messy empirics and a lack of mathematical language for causality, which is why he coined the do operator).
Pearl discusses this at length in The Book of Why, particularly the Chapter 2 sections on “Galton and the Abandoned Quest” and “Pearson: The Wrath of the Zealot.” My guess is that Pearl’s frustration with statisticians’ focus on correlation was immediate upon getting to know the field, but I don’t think he’s publicly said how his frustration began.
Is Rubin’s work actually the same as Pearl’s??
Please tell more?
That’s not the impression from reading Pearl s causality. If so, seems like a major omission of scholarship
Rubin’s framework says basically, suppose all our observations are in a big data table. Now consider the counterfactual observations that didn’t happen (i.e. people in the control group getting the treatment) -- these are called “potential outcomes”—treat those like missing cells in the data table. Then causal inference is just to fill in potential outcomes using missing data imputation techniques, although to be valid these require some assumptions about conditional independence.
Pearl’s framework and Rubin’s are isomorphic in the sense that any set of causal assumptions in Pearl’s framework (a structural causal model, which has a DAG structure), can be translated into a set of causal assumptions in Rubin’s framework (a bunch of conditional independence assumptions about potential outcomes), and vice versa. This is touched on somewhat in Ch. 7 of “Causality”.
Pearl argues that despite this equivalence, his framework is superior because it’s a better tool for thinking. In other words, writing down your assumptions as DAG/SCM is intuitive and can be explained and argued about, while he claims the Rubin model independence assumptions are opaque and hard to understand.
Some reading on this:
https://csss.uw.edu/files/working-papers/2013/wp128.pdf
http://proceedings.mlr.press/v89/malinsky19b/malinsky19b.pdf
https://arxiv.org/pdf/2008.06017.pdf
—
From my experience it pays to learn how to think about causal inference like Pearl (graphs, structural equations), and also how to think about causal inference like Rubin (random variables, missing data). Some insights only arise from a synthesis of those two views.
Pearl is a giant in the field, but it is worth remembering that he’s unusual in another way (compared to a typical causal inference researcher) -- he generally doesn’t worry about actually analyzing data.
---
By the way, Gauss figured out not only the normal distribution trying to track down Ceres’ orbit, he actually developed the least squares method, too! So arguably the entire loss minimization framework in machine learning came about from thinking about celestial bodies.
Aha, I will have to ponder on this for a while. Thanks a lot!