Question: we want to estimate a causal effect of X on Y from observational data, but we have confounding variables we observe. What variables do we adjust for to get an unbiased estimate of causal effect.
Rubin: All of them (we should condition on all available data, so we don’t waste information).
Pearl: those and only those which block back-door paths but not causal paths in the graph.
I think what is going on is there are two separate issues here. Pearl is talking about an identification issue—what functional represents causal effects in an unbiased way. Rubin is talking about an estimation issue—we should use all available information to reduce uncertainty in our estimate. Pearl is talking about bias, Rubin is talking about variance.
In my view, the “right answer” is that if we want the effect of X on Y, we have to both:
(a) Use all available information (the functional for the effect is a function of all variables ancestral of Y not through X).
(b) Use all available information in the “right way” to avoid bias. That is, we don’t just want to condition on a particular ancestor of Y, we may have to do more complex things to avoid bias.
Here’s a paper we wrote that gives an unbiased maximum likelihood estimator for all identifiable causal effects in discrete models with hidden variables: http://arxiv.org/pdf/1202.3763.pdf. Because the estimate is an MLE it uses all information like Rubin wants. Because the estimate is unbiased, Pearl should be happy as well.
By the way, “M-bias” refers to a situation where we observe a variable that correlates with both X and Y but is not an ancestor of Y not through X. Simplest graph: X → Y <-> W <-> X. In this case, the right thing to do is to not condition on W, or indeed use W in any way when estimating p(y | do(x)). The MLE for p(y | do(x)) does not use W, so we don’t lose information by ignoring W. So in this particular case, Pearl is right to worry about bias when conditioning on W, and Rubin is wrong to worry about missing information when not conditioning on W (there is no information to miss).
“All of them” cannot obviously be literally true, because for instance we don’t want to condition on the future of Y even if we observe it (the future is just the noisy sensor version of the present, it carries no extra information, just extra randomness).
From your description, it seems that Rubin wants to predict what happens in the world, and Pearl insists on asking and answering questions about what happens in the world in terms of causal language.
What’s the simplest prediction of what happens in the world that Pearl would claim Rubin cannot accurately make?
If there is no such limitation in Rubin’s approach, we’re arguing convenient notation. My preference lies with the most general notation, with the least amount of special case jargon, so I likely will be on Rubin’s side.
Pearl likes graphs, but graphs are just a mathematical aid. What he and Rubin are talking about is not “about” graphs. You can prove all the theorems without graphs. Both Pearl and Rubin are talking about potential outcomes (interventionist view of causality). Pearl uses a model which makes cross-world independence assumptions (Rubin probably does not, although I have not asked him. Of course Rubin loves “principal stratification” which as far as I understand is wildly untestable, so who really knows what he thinks. A lot of workers in the field do not like cross-world independences because they are not testable).
To the extent that Rubin wants to estimate potential outcome random variables from observational data, he HAS to agree with Pearl on pain of bias (e.g. garbage). In the example I gave, if Rubin insists on conditioning on W, he will get a garbage answer for the potential outcome Y(x). Identification of potential outcomes isn’t the kind of thing where you can have a difference of opinion. It’s like having on opinion on what 2 + 2 is.
In the example I gave, if Rubin insists on conditioning on W, he will get a garbage answer for the potential outcome Y(x).
From your description, you say that Rubin insists on conditioning on all available data, so that includes W. But that doesn’t mean he has to get garbage, that just means he needs the right conditional.
Let Jaynes notation do the work. The base problem seems to be:
You can assign probabilities using observational data to create P(X1...XN | Intervention=No). How do I use that model to assign P(X1...XN | Intervention=Yes)?
Do these guys have any case where they make different predictions of what will happen in an intervention? Or do they just dance around in their own languages and come up with the same predictions?
“From your description, you say that Rubin insists on conditioning on all available data, so that includes W. But that doesn’t mean he has to get garbage, that just means he needs the right conditional.”
The right expression for p(y | do(x)) in this example should ignore W, that’s all there is to it. It’s not a notational issue.
“You can assign probabilities using observational data to create P(X1...XN | Intervention=No). How do I use that model to assign P(X1...XN | Intervention=Yes)?”
Good question! The answer is to use something called the consistency assumption (I think Pearl might call it “composition” in his book). This states, roughly that Y(X) = Y. (That is, observing Y when there is no intervention is the same as observing Y when X is intervened to attain whatever value it would naturally attain). This assumption is untestable, but to my knowledge every single paper in causal inference makes this assumption in some form. Without something like this assumption there is no link between the data we observe and the data after a hypothetical intervention.
I think the kinds of examples that are drastically biased given Rubin’s “condition on everything” policy are not very common in practical data analysis problems, but it’s certainly easy to construct them. While I have not asked him, I suspect if I were to put a gun to Rubin’s head and gave him the above example, he will admit to not adjusting on W (and then say the situations in the example never happen in practice).
My view: M-bias is a special case of a more general issue where conditioning opens paths (due to how d-separation works in graphs). The way this issue manifests in practice is people assume they observe all confounders, adjust for them, get an estimate, and call it a day. In practice, their assumption is wrong, adjusting for all observable confounders opens a bunch of non-causal paths due to the inevitable presence of hidden variables, and the estimate they get is biased for this reason. There is, however, some evidence that this bias is sometimes not very big (I think Sander Greenland did some work on this)
Here’s the short version:
Question: we want to estimate a causal effect of X on Y from observational data, but we have confounding variables we observe. What variables do we adjust for to get an unbiased estimate of causal effect.
Rubin: All of them (we should condition on all available data, so we don’t waste information).
Pearl: those and only those which block back-door paths but not causal paths in the graph.
I think what is going on is there are two separate issues here. Pearl is talking about an identification issue—what functional represents causal effects in an unbiased way. Rubin is talking about an estimation issue—we should use all available information to reduce uncertainty in our estimate. Pearl is talking about bias, Rubin is talking about variance.
In my view, the “right answer” is that if we want the effect of X on Y, we have to both:
(a) Use all available information (the functional for the effect is a function of all variables ancestral of Y not through X).
(b) Use all available information in the “right way” to avoid bias. That is, we don’t just want to condition on a particular ancestor of Y, we may have to do more complex things to avoid bias.
Here’s a paper we wrote that gives an unbiased maximum likelihood estimator for all identifiable causal effects in discrete models with hidden variables: http://arxiv.org/pdf/1202.3763.pdf. Because the estimate is an MLE it uses all information like Rubin wants. Because the estimate is unbiased, Pearl should be happy as well.
By the way, “M-bias” refers to a situation where we observe a variable that correlates with both X and Y but is not an ancestor of Y not through X. Simplest graph: X → Y <-> W <-> X. In this case, the right thing to do is to not condition on W, or indeed use W in any way when estimating p(y | do(x)). The MLE for p(y | do(x)) does not use W, so we don’t lose information by ignoring W. So in this particular case, Pearl is right to worry about bias when conditioning on W, and Rubin is wrong to worry about missing information when not conditioning on W (there is no information to miss).
“All of them” cannot obviously be literally true, because for instance we don’t want to condition on the future of Y even if we observe it (the future is just the noisy sensor version of the present, it carries no extra information, just extra randomness).
From your description, it seems that Rubin wants to predict what happens in the world, and Pearl insists on asking and answering questions about what happens in the world in terms of causal language.
What’s the simplest prediction of what happens in the world that Pearl would claim Rubin cannot accurately make?
If there is no such limitation in Rubin’s approach, we’re arguing convenient notation. My preference lies with the most general notation, with the least amount of special case jargon, so I likely will be on Rubin’s side.
Pearl likes graphs, but graphs are just a mathematical aid. What he and Rubin are talking about is not “about” graphs. You can prove all the theorems without graphs. Both Pearl and Rubin are talking about potential outcomes (interventionist view of causality). Pearl uses a model which makes cross-world independence assumptions (Rubin probably does not, although I have not asked him. Of course Rubin loves “principal stratification” which as far as I understand is wildly untestable, so who really knows what he thinks. A lot of workers in the field do not like cross-world independences because they are not testable).
To the extent that Rubin wants to estimate potential outcome random variables from observational data, he HAS to agree with Pearl on pain of bias (e.g. garbage). In the example I gave, if Rubin insists on conditioning on W, he will get a garbage answer for the potential outcome Y(x). Identification of potential outcomes isn’t the kind of thing where you can have a difference of opinion. It’s like having on opinion on what 2 + 2 is.
From your description, you say that Rubin insists on conditioning on all available data, so that includes W. But that doesn’t mean he has to get garbage, that just means he needs the right conditional.
Let Jaynes notation do the work. The base problem seems to be:
You can assign probabilities using observational data to create P(X1...XN | Intervention=No). How do I use that model to assign P(X1...XN | Intervention=Yes)?
Do these guys have any case where they make different predictions of what will happen in an intervention? Or do they just dance around in their own languages and come up with the same predictions?
“From your description, you say that Rubin insists on conditioning on all available data, so that includes W. But that doesn’t mean he has to get garbage, that just means he needs the right conditional.”
The right expression for p(y | do(x)) in this example should ignore W, that’s all there is to it. It’s not a notational issue.
“You can assign probabilities using observational data to create P(X1...XN | Intervention=No). How do I use that model to assign P(X1...XN | Intervention=Yes)?”
Good question! The answer is to use something called the consistency assumption (I think Pearl might call it “composition” in his book). This states, roughly that Y(X) = Y. (That is, observing Y when there is no intervention is the same as observing Y when X is intervened to attain whatever value it would naturally attain). This assumption is untestable, but to my knowledge every single paper in causal inference makes this assumption in some form. Without something like this assumption there is no link between the data we observe and the data after a hypothetical intervention.
I think the kinds of examples that are drastically biased given Rubin’s “condition on everything” policy are not very common in practical data analysis problems, but it’s certainly easy to construct them. While I have not asked him, I suspect if I were to put a gun to Rubin’s head and gave him the above example, he will admit to not adjusting on W (and then say the situations in the example never happen in practice).
My view: M-bias is a special case of a more general issue where conditioning opens paths (due to how d-separation works in graphs). The way this issue manifests in practice is people assume they observe all confounders, adjust for them, get an estimate, and call it a day. In practice, their assumption is wrong, adjusting for all observable confounders opens a bunch of non-causal paths due to the inevitable presence of hidden variables, and the estimate they get is biased for this reason. There is, however, some evidence that this bias is sometimes not very big (I think Sander Greenland did some work on this)