It is certainly true that you can get any results you want if you pick control variables based on how it changes the effect estimates. This is a major problem with almost all observational research, and greatly reduces its reliability. People have suggested getting around it by requiring all investigators to publish their protocols before data collection begins, so that they have to specify at that time exactly what variables they will control for. This would be a good start, but it does not appear to be on the horizon.
The only way to obtain a valid estimate of a causal effect, is to run a randomized controlled trial. However, this is not always feasible, and in situations where we really need information on causal effects but are unable to perform the experiment, the only thing we can do is to assume that nature ran a randomized controlled trial for us. This might have been a complicated trial where “nature” used one loaded dice in men with high GPA, another loaded dice in men with low GPA, a third dice in women with high GPA and a fourth dice in women with low GPA etc.
If the data came from such a trial, it will be possible to recover an estimate of the causal effect. In a simple situation, a linear regression model that controls for sex and GPA may be sufficient; but in most realistic settings, you will need models that are able to account for the fact that “exposure” varies with time. The important thing to note is that the validity of the estimate is only as valid as your belief in the untestable assumption that the data came from a randomized trial run by nature.
What Pearl’s causal framework gives you, is a very powerful language for reasoning about whether it is reasonable to interpret observational data as if it came from a randomized controlled trial run by nature. Pearl’s students will be able to tell you whether the framework has been applied correctly, but this is not sufficient to assess the trustworthiness of the result: For that, you need someone with subject matter expertise, ie, someone who is willing to make untestable claims about whether the map (the DAG model) matches the territory (the data generating mechanism)
My main point is that you shouldn’t just automatically discount observational studies, but instead use DAG models to reason carefully about what level of confidence you assign to the investigators’ implicit claim that the data came from a natural experiment where exposure was assigned by loaded dice that differed between the groups defined by the control variables.
-
Edited to add that if you believe the main problem with observational studies is that investigators choose their set of control variables in order to get the results they want, one solution to this is to use Don Rubin’s propensity score matching method, which specifically avoids this problem. See this paper. The only problem with the propensity score method, is that it does not generalize to situations where exposure varies with time; in fact, students who are trained in the Rubin Causal Model (which competes with Pearl’s graphical model) often become blind to a range of biases that arise because exposure varies with time.
So when Robin Hanson wants to know the real effect of health spending on health, he doesn’t look for correlational control-variables studies on the effect of health spending on health, because he knows those studies will return whatever the researchers want it to say. What Robin does instead is look for studies that happen to control for health care spending, on the way to making some other point, and then look at what the correlation coefficient was in those studies, which aren’t putatively about healthcare; and according to Robin the coefficient is usually zero.
This is an example of clever data, obtained despite of the researchers, which I might be inclined to trust—perhaps too much so, for its cleverness. But the scarier moral is that correlational studies are bad enough, and by the time you add in control variables, the researchers usually get whatever result they want. If you trust a result at all in a correlational study, it should be because you think the researchers weren’t thinking about that result at all and were unlikely to ‘optimize’ it by accident while they were optimizing the study outcome they were interested in.
But the scarier moral is that correlational studies are bad enough, and by the time you add in control variables, the researchers usually get whatever result they want.
Hmm.
Here’s my thinking about this in the context of the post.
If the presence of trait A precedes the presence of trait B, and there’s correlation between trait A and trait B, then this establishes a prior that trait A causes trait B. The strength of the prior depends (in some sense) on the number of traits correlated with trait A that precede the presence of trait B, and one updates from the prior based on the plausibility of causal pathways in each case.
In the case of college attended and earnings, we have two hypotheses (that constitute the bulk of the probabilistic effect sizes) as to the source of the correlation: (i) going to a more selective college increases earnings, and (ii) traits that get people into more selective colleges increase earnings.
To test for (ii), one controls for features that feed into college admissions. GPA and SAT scores are the most easy of these to obtain data on, but there are others, such as class rank, extracurricular activities, essays, whether one is a strong athlete, whether one’s parents are major donors to the college, etc. To pick up on some of these, the authors control for the average SAT score of the colleges that the student applied to, and number of applications submitted which measure of the student’s confidence that he or she can get into selective colleges (the intuition being that if a student submits only a small number of applications and applies only to top colleges, he or she has confidence that he or she will get into one).
The question is then whether there are sufficiently many other metrics (with large publicly available data sets) of the characteristics that get students into college so that the authors could have cherry picked ones that move the correlation to be statistically indistinguishable from 0. Can you name five?
If the presence of trait A precedes the presence of trait B
You mean preceeds in time? What if A is my paternal grandfather’s eye color (black), and B is my eye color (black)? Our eye color is correlated due to common ancestry, and A preceeds B in time. But A does not cause B. There are lots of correlated things in the world due to a common cause, and generally one of them preceeds another in time.
You can’t talk about correlation and time like that. I think the only thing we can say is probably macroscopic retrocausation should be disallowed.
The way interventionists think about effects is that the effect of A on B in a person C is really about how B would change in a hypothetical person C’ who differs from C only in that we changed their A. It’s not about correlation, dependence, temporal order, or anything like that.
This approach might work sometimes, but I think it is problematic in most cases for the following reason:
Health care spending can only affect health through medical interventions (unless it is possible to extend someone’s life by signalling that you care enough to spend money on health care).
If the study is designed to estimate the effect of some medical intervention, that intervention will be in the regression model. If you want to interpret the coefficient for health care spending causally, you have a major problem in that the primary causal pathway has been blocked by conditioning on whether the patient got the intervention. In such situations, the coefficient of health care spending would be expected to be zero even if it has a causal effect through the intervention.
The important thing to note is that the validity of the estimate is only as valid as your belief in the untestable assumption
Can I check that when you and others writing on statistics and causality talk about “untestable assumptions”, you mean assumptions not testable by the experiment under discussion. Presumably the assumptions are based on previously acquired knowledge which may well have been tested, and had better be capable of being tested by other possible experiments; it’s just that the present experiment is not capable of providing any further evidence about the assumption.
Good question! “Untestable assumption” can actually mean two different things:
In this context, you are correct to point out that I am talking about assumptions that are not testable by the data we are analyzing. I would be able to falsify my unconfoundedness assumption if I run an experiment where I first observe what value the treatment variable would take naturally in all individuals, then intervene so that everyone are treated, and look at whether the distribution of the outcome differs between the group who naturally would have had treatment, and the group who naturally would not have had treatment
In other contexts, there are other types of untestable assumptions, which are unfalsifiable even in principle. These relate to independences between counterfactual variables from different “worlds”. Basically, they assume that certain columns in your ideal dataset are independent of each other, when it is impossible even in theory to observe those two columns in the whole population at the same time.
If you refuse to make assumptions of the second type, you will still be able to estimate the effect of any intervention that is identifiable in Pearl’s causal framework NPSEM, but you will not be able to analyze mediation or causal pathways. This is the difference between Pearl’s model NPSEM and Robins’ model FFRCISTG. The refusal to make unfalsifiable assumptions about independences between cross-world counterfactuals is also the primary motivation behind the “Single World Intervention Graph” paper by Robins and Richardson, which Ilya linked to in another comment in this thread.
Good post, thanks. FFRCISTG still assumes SUTVA, which is untestable (also, like any structural equation model, it assumes absent arrows represent absence of individual level effects, which seems like it is also untestable (?)).
. . . someone should write up an explanation and post it here. :)
I think I might be confused by the concept of testability. But with that out of the way:
no, we really mean “untestable.” SUTVA (stable unit treatment value assumption) is a two part assumption:
first, it assumes that if we give the treatment to one person, this does not affect other people in the study (unclear how to check for this...)
second, it assumes that if we observed the exposure A is equal to a, then there is no difference between the observed responses for any person, and the responses for that same person under a hypothetical randomized study where we assigned A to a for that person (unclear how to check for this either… talks about hypothetical worlds).
Causal inference from observational data has to rely on untestable assumptions to link what we see with what would have happened under a hypothetical experiment. If you don’t like this, you should give up on causal inference from observational data (and you would be in good company if you do—Ronald Fisher and lots of other statisticians were extremely skeptical).
It’s not clear to me how large a class of statements you’re considering untestable. Are all counterfactual statements untestable (because they are about non-existent worlds)?
To take an example I just googled up, page 7 of this gives an example of a violation of the first of the SUVTA conditions. Is that violating circumstance, or its absence, untestable even outside of the particular study?
Another hypothetical example would be treatment of patients having a dangerous and infectious disease. One would presumably be keeping each one in isolation; is the belief that physical transmission of microorganisms from one person to another may result in interference between patient outcomes untestable? Surely not.
Such a general concept of untestability amounts to throwing up one’s hands and saying saying “what can we ever know?”, while looking around at the world shows that in fact we know a great deal. I cannot believe that this is what you are describing as untestable, but then it is not clear to me what the narrower bounds of the class are.
At the opposite extreme, some assumptions called untestable are described as “domain knowledge”, in which case they are as testable as any other piece of knowledge—where else does “domain knowledge” come from? -- but merely fail to be testable by the data under present consideration.
It’s not clear to me how large a class of statements you’re considering untestable.
As I said, I am confused about the concept of testability. While I work out a general account I am happy with (or perhaps abandon ship in a Bayeswardly direction or something) I am relying on a folk conception to get statements that, regardless of what the ultimate account of testability might be, are definitely untestable. That is, we cannot imagine an effective procedure that would, even in principle, check if the statement is true.
The standard example is Smoking → Tar → Cancer
The statement “the random variables I have cancer given that I was _assigned_ to smoke and I have tar in my lungs given that I was _assigned_ not to smoke are independent” is untestable.
That’s because to test this independence, I have to simultaneously consider a world where I was assigned to smoke, and another world where I was assigned not to smoke, and consider a joint distribution over these two worlds. But we only can access one such world at a time, unless we can roll back time, or jump across Everett branches.
Pearl does not concern himself with testability very much, because Pearl is a computer scientist, and to Pearl the world of Newtonian physics is like a computer circuit, where it is obvious that everything stays invariant to arbitrary counterfactual alterations of wires, and in particular sources of noise stay independent. But the applications of causal inference isn’t on circuits, but on much mushier problems—like psychology or medicine. In such domains it is not clear why assumptions like my example intuitively should hold, and not clear how to test them.
Such a general concept of untestability amounts to throwing up one’s hands and saying saying “what can
we ever know?”
This is not naive skepticism, this is a careful account of assumptions, a very good habit among statisticians, in my opinion. We need more of this in statistical and causal analysis, not less.
The statement “the random variables I have cancer given that I was assigned to smoke and I have tar in my lungs given that I was assigned not to smoke are independent” is untestable.
Can you give me a larger context for that example? A pointer to a paper that uses it would be enough.
At the moment I’m not clear what the independence of these means, if they’re understood as statements about non-interacting world branches. What is the mathematical formulation of the assertion that they are independent? How, in mathematical terms, would that assumption play a role in the study of whether smoking causes cancer?
From another point of view, suppose that we knew the exact mechanisms whereby smoke, tar, and everything else have effects on the body leading to cancer. Would we then be able to calculate the truth or falsity of the assumption?
(there are lots of refs in there as well, for more reading).
The “branches” are interacting because they share the past, although I was being imprecise when I was talking about Everett branches—these hypothetical worlds are mathematical abstractions, and do not correspond directly to a part of the wave function at all. There is no developed extension of interventionist causality to quantum theory (nor is it clear whether this is a good idea—the intervention abstraction might not make sense in that setting).
Thanks, I now have a clearer idea of what these expressions mean and why they matter. You write on page 15:
Defining the influence of A on Y for a particular unit u as Y(1,M(0,u),u) involved a seemingly impossible hypothetical situation, where the treatment given to u was 0 for the purposes of the mediator M, and 1 for the purposes of the outcome Y.
For the A/M/Y = smoking/tar/cancer situation I can imagine a simple way of creating this situation: have someone smoke cigarettes with filters that remove all of the tar but nothing else. There may be practical engineering problems in creating such a filter, and ethical considerations in having experimental subjects smoke, but it does not seem impossible in principle. This intervention sets A to 1 and M to M(0,u), allowing the measurement of Y(1,M(0,u),u).
As with the case of the word “untestable”, I am wondering if “impossible” is here being understood to mean, not impossible in an absolute sense, but “impossible within some context of available means, assumed as part of the background”. For example, “impossible without specific domain knowledge”, or “impossible given only the causal diagram and some limited repertoire of feasible interventions and observations”. The tar filter scenario goes outside those bounds by using domain knowledge to devise a way of physically erasing the arrow from A to M.
I have the same question about page 18, where you say that equation (15):
Y(1,m) _||_ M(0)
is untestable (this is the example you expressed in words upthread), even though you have shown that it mathematically follows from any SEM of a certain form relating the variables, and could be violated if it has certain different forms. The true causal relationships, whatever they are, are observable physical processes. If we could observe them all, we would observe whether Y(1,m) _||_ M(0).
Again, by “untestable” do you here mean untestable within certain limits on what experiments can be done?
This paper is about an argument the authors are having with Judea Pearl about whether assumptions like the one we are talking about are sensible to make. Of particular relevance for us is section 5.1. If I understood the point the authors are making, whenever Judea justifies such an assumption, he tells a story that is effectively interventional (very similar to your story about a filter). That is, what really is happening is we are replacing the graph:
A → M → Y, A → Y
by another graph:
A → A1 → Y, A → A2 → M → Y
where A1 is the “non tar-producing part” of smoking, and A2 is the “tar-producing part” of smoking (the example in 5.1 was talking about nicotine instead). As long as we can tell such a story, the relevant counterfactual is implemented via interventions, and all is well. That is, Y(A=1,M(A=0)) in graph 1 is the same thing as Y(A1=1,A2=0) in graph 2.
The true causal relationships, whatever they are, are observable physical processes. If we could observe
them all, we would observe whether Y(1,m) || M(0).
The point of doing mediation analysis in the first place is because we are being empiricist—using data for scientific discovery. In particular, we are trying to learn a fairly crude fact about cause-effect relationships of A, M and Y. If, as you say, we were able to observe the entire relevant DAG, and all biochemical events involved in the A → M → Y chain, then we would already be done, and would not need to do our analysis in the first place.
“Testability” (the concept I am confused about) comes up in the process of scientific work, which is crudely about expanding a lit circle of the known via sensible procedures. So intuitively, “testability” has to involve the resources of the lit circle itself, not of things in the darkness. This is because there is a danger of circularity otherwise.
My main point is that you shouldn’t just automatically discount observational studies, but instead use DAG models
to reason carefully about what level of confidence you assign to the investigators’ implicit claim that the data came
from a natural experiment where exposure was assigned by loaded dice that differed between the groups defined
by the control variables.
I agree with this, but disagree with a number of other things you say here. I will add that sometimes the DAG is more complicated than observed exposure influenced by observed “baseline covariates” (what you call “control variables.”) Sometimes causes of observed exposure are unobserved and influence observed outcomes—but you can still get the causal effect (by a more complicated procedure than one that relies on variable adjustment).
in fact, students who are trained in the Rubin Causal Model (which competes with Pearl’s graphical model) often
become blind to a range of biases that arise because exposure varies with time.
Rubin competes with Pearl. The Rubin causal model does not compete with Pearl’s graphical models, they are the same thing. Rubin just doesn’t understand/like graphs. See here:
Is it actually true that people using Rubin’s model will not handle time-varying confounding properly? Robins expresses time-varying confounding problems using sequential ignorability, which I think ought to be quite simple to Rubin people.
one solution to this is to use Don Rubin’s propensity score matching method, which specifically avoids this problem
I think you are just wrong here. With all due respect to Don Rubin, propensity score matching has nothing to do with avoiding bias, it is an estimation method for a functional you get when you adjust for confounding:
p(Y(a)) = \sum_{a} p(Y | a, c) p(c)
if conditional ignorability: (Y(a) independent of A given C) and SUTVA hold.
It is just one method of many for estimating the above functional, the other being inverse weights, or the parametric g-formula, or the doubly robust methods, or whatever other ways people have invented. None of these estimation methods are going to avoid the issue if the functional above is not equal to the causal effect you want. Whether that is true has to do with whether conditional ignorability is true in your problem or not, not whether you use propensity score methods.
I don’t disagree with anything you are saying: There is nothing in the propensity score estimation method itself that makes the results less prone to bias, compared to other methods
I should have been more specific, my point was rather that the specific implementation of propensity score matching which Rubin recommends in the paper I linked to, allows the investigator to blind themselves to the outcome while assessing whether they have been able to create balanced groups. It is possible that you can do something similar with other estimation methods, but I haven’t heard anyone talk much about it, and it is not immediately obvious to me how you would go about it
Thanks for the quick clarification! I guess I am not following you. If you want to blind yourself, you can just do it—you don’t need to modify the estimator in any way, just write the computer program implementing your estimator in such a way that you don’t see the answer. This issue seems to be completely orthogonal to both causal inference and estimation. (???)
If you use the propensity score matching method, you begin by estimating the propensity score, then you match on the propensity score to create exposed and unexposed groups within levels of the propensity score. After you create those groups, there is a step where you can look at the matched groups without the outcome data, and assess whether you have achieved balance on the baseline covariates. If I understand Rubin’s students correctly, they see this as a major advantage of the estimation method.
You can obviously blind yourself to the outcome using any estimation method, but I am not sure if there is a step in the process where you look at the data without the outcome to evaluate how confident you are in your work.
In order for the estimand of the propensity score method to be unbiased, the following is sufficient:
(a) SUTVA (this is untestable)
(b) Conditional ignorability (this is testable in principle, but only if we randomize the exposure A)
(c) The treatment assignment probability model (that is the model for p(A | C), where A is exposure and C is baseline covariates) must be correct.
It may be that the “balance property” tests a part of (b), but surely not all of it! That is, the arms might look balanced, but conditional ignorability might still not hold. We cannot test all the assumptions we need to draw causal conclusions from observational data using only observational data. Causal assumptions have to enter in somewhere!
I think I might be confused about why checking for balance without working out the effect is an advantage—but I will think about it, because I am not an expert on propensity score methods, so there is probably something I am missing.
It is certainly true that you can get any results you want if you pick control variables based on how it changes the effect estimates. This is a major problem with almost all observational research, and greatly reduces its reliability. People have suggested getting around it by requiring all investigators to publish their protocols before data collection begins, so that they have to specify at that time exactly what variables they will control for. This would be a good start, but it does not appear to be on the horizon.
The only way to obtain a valid estimate of a causal effect, is to run a randomized controlled trial. However, this is not always feasible, and in situations where we really need information on causal effects but are unable to perform the experiment, the only thing we can do is to assume that nature ran a randomized controlled trial for us. This might have been a complicated trial where “nature” used one loaded dice in men with high GPA, another loaded dice in men with low GPA, a third dice in women with high GPA and a fourth dice in women with low GPA etc.
If the data came from such a trial, it will be possible to recover an estimate of the causal effect. In a simple situation, a linear regression model that controls for sex and GPA may be sufficient; but in most realistic settings, you will need models that are able to account for the fact that “exposure” varies with time. The important thing to note is that the validity of the estimate is only as valid as your belief in the untestable assumption that the data came from a randomized trial run by nature.
What Pearl’s causal framework gives you, is a very powerful language for reasoning about whether it is reasonable to interpret observational data as if it came from a randomized controlled trial run by nature. Pearl’s students will be able to tell you whether the framework has been applied correctly, but this is not sufficient to assess the trustworthiness of the result: For that, you need someone with subject matter expertise, ie, someone who is willing to make untestable claims about whether the map (the DAG model) matches the territory (the data generating mechanism)
My main point is that you shouldn’t just automatically discount observational studies, but instead use DAG models to reason carefully about what level of confidence you assign to the investigators’ implicit claim that the data came from a natural experiment where exposure was assigned by loaded dice that differed between the groups defined by the control variables.
-
Edited to add that if you believe the main problem with observational studies is that investigators choose their set of control variables in order to get the results they want, one solution to this is to use Don Rubin’s propensity score matching method, which specifically avoids this problem. See this paper. The only problem with the propensity score method, is that it does not generalize to situations where exposure varies with time; in fact, students who are trained in the Rubin Causal Model (which competes with Pearl’s graphical model) often become blind to a range of biases that arise because exposure varies with time.
So when Robin Hanson wants to know the real effect of health spending on health, he doesn’t look for correlational control-variables studies on the effect of health spending on health, because he knows those studies will return whatever the researchers want it to say. What Robin does instead is look for studies that happen to control for health care spending, on the way to making some other point, and then look at what the correlation coefficient was in those studies, which aren’t putatively about healthcare; and according to Robin the coefficient is usually zero.
This is an example of clever data, obtained despite of the researchers, which I might be inclined to trust—perhaps too much so, for its cleverness. But the scarier moral is that correlational studies are bad enough, and by the time you add in control variables, the researchers usually get whatever result they want. If you trust a result at all in a correlational study, it should be because you think the researchers weren’t thinking about that result at all and were unlikely to ‘optimize’ it by accident while they were optimizing the study outcome they were interested in.
Hmm.
Here’s my thinking about this in the context of the post.
If the presence of trait A precedes the presence of trait B, and there’s correlation between trait A and trait B, then this establishes a prior that trait A causes trait B. The strength of the prior depends (in some sense) on the number of traits correlated with trait A that precede the presence of trait B, and one updates from the prior based on the plausibility of causal pathways in each case.
In the case of college attended and earnings, we have two hypotheses (that constitute the bulk of the probabilistic effect sizes) as to the source of the correlation: (i) going to a more selective college increases earnings, and (ii) traits that get people into more selective colleges increase earnings.
To test for (ii), one controls for features that feed into college admissions. GPA and SAT scores are the most easy of these to obtain data on, but there are others, such as class rank, extracurricular activities, essays, whether one is a strong athlete, whether one’s parents are major donors to the college, etc. To pick up on some of these, the authors control for the average SAT score of the colleges that the student applied to, and number of applications submitted which measure of the student’s confidence that he or she can get into selective colleges (the intuition being that if a student submits only a small number of applications and applies only to top colleges, he or she has confidence that he or she will get into one).
The question is then whether there are sufficiently many other metrics (with large publicly available data sets) of the characteristics that get students into college so that the authors could have cherry picked ones that move the correlation to be statistically indistinguishable from 0. Can you name five?
You mean preceeds in time? What if A is my paternal grandfather’s eye color (black), and B is my eye color (black)? Our eye color is correlated due to common ancestry, and A preceeds B in time. But A does not cause B. There are lots of correlated things in the world due to a common cause, and generally one of them preceeds another in time.
You can’t talk about correlation and time like that. I think the only thing we can say is probably macroscopic retrocausation should be disallowed.
The way interventionists think about effects is that the effect of A on B in a person C is really about how B would change in a hypothetical person C’ who differs from C only in that we changed their A. It’s not about correlation, dependence, temporal order, or anything like that.
This approach might work sometimes, but I think it is problematic in most cases for the following reason:
Health care spending can only affect health through medical interventions (unless it is possible to extend someone’s life by signalling that you care enough to spend money on health care).
If the study is designed to estimate the effect of some medical intervention, that intervention will be in the regression model. If you want to interpret the coefficient for health care spending causally, you have a major problem in that the primary causal pathway has been blocked by conditioning on whether the patient got the intervention. In such situations, the coefficient of health care spending would be expected to be zero even if it has a causal effect through the intervention.
Can I check that when you and others writing on statistics and causality talk about “untestable assumptions”, you mean assumptions not testable by the experiment under discussion. Presumably the assumptions are based on previously acquired knowledge which may well have been tested, and had better be capable of being tested by other possible experiments; it’s just that the present experiment is not capable of providing any further evidence about the assumption.
Good question! “Untestable assumption” can actually mean two different things:
In this context, you are correct to point out that I am talking about assumptions that are not testable by the data we are analyzing. I would be able to falsify my unconfoundedness assumption if I run an experiment where I first observe what value the treatment variable would take naturally in all individuals, then intervene so that everyone are treated, and look at whether the distribution of the outcome differs between the group who naturally would have had treatment, and the group who naturally would not have had treatment
In other contexts, there are other types of untestable assumptions, which are unfalsifiable even in principle. These relate to independences between counterfactual variables from different “worlds”. Basically, they assume that certain columns in your ideal dataset are independent of each other, when it is impossible even in theory to observe those two columns in the whole population at the same time.
If you refuse to make assumptions of the second type, you will still be able to estimate the effect of any intervention that is identifiable in Pearl’s causal framework NPSEM, but you will not be able to analyze mediation or causal pathways. This is the difference between Pearl’s model NPSEM and Robins’ model FFRCISTG. The refusal to make unfalsifiable assumptions about independences between cross-world counterfactuals is also the primary motivation behind the “Single World Intervention Graph” paper by Robins and Richardson, which Ilya linked to in another comment in this thread.
Good post, thanks. FFRCISTG still assumes SUTVA, which is untestable (also, like any structural equation model, it assumes absent arrows represent absence of individual level effects, which seems like it is also untestable (?)).
. . . someone should write up an explanation and post it here. :)
I think I might be confused by the concept of testability. But with that out of the way:
no, we really mean “untestable.” SUTVA (stable unit treatment value assumption) is a two part assumption:
first, it assumes that if we give the treatment to one person, this does not affect other people in the study (unclear how to check for this...)
second, it assumes that if we observed the exposure A is equal to a, then there is no difference between the observed responses for any person, and the responses for that same person under a hypothetical randomized study where we assigned A to a for that person (unclear how to check for this either… talks about hypothetical worlds).
Causal inference from observational data has to rely on untestable assumptions to link what we see with what would have happened under a hypothetical experiment. If you don’t like this, you should give up on causal inference from observational data (and you would be in good company if you do—Ronald Fisher and lots of other statisticians were extremely skeptical).
It’s not clear to me how large a class of statements you’re considering untestable. Are all counterfactual statements untestable (because they are about non-existent worlds)?
To take an example I just googled up, page 7 of this gives an example of a violation of the first of the SUVTA conditions. Is that violating circumstance, or its absence, untestable even outside of the particular study?
Another hypothetical example would be treatment of patients having a dangerous and infectious disease. One would presumably be keeping each one in isolation; is the belief that physical transmission of microorganisms from one person to another may result in interference between patient outcomes untestable? Surely not.
Such a general concept of untestability amounts to throwing up one’s hands and saying saying “what can we ever know?”, while looking around at the world shows that in fact we know a great deal. I cannot believe that this is what you are describing as untestable, but then it is not clear to me what the narrower bounds of the class are.
At the opposite extreme, some assumptions called untestable are described as “domain knowledge”, in which case they are as testable as any other piece of knowledge—where else does “domain knowledge” come from? -- but merely fail to be testable by the data under present consideration.
As I said, I am confused about the concept of testability. While I work out a general account I am happy with (or perhaps abandon ship in a Bayeswardly direction or something) I am relying on a folk conception to get statements that, regardless of what the ultimate account of testability might be, are definitely untestable. That is, we cannot imagine an effective procedure that would, even in principle, check if the statement is true.
The standard example is Smoking → Tar → Cancer
The statement “the random variables
I have cancer given that I was _assigned_ to smoke
andI have tar in my lungs given that I was _assigned_ not to smoke
are independent” is untestable.That’s because to test this independence, I have to simultaneously consider a world where I was assigned to smoke, and another world where I was assigned not to smoke, and consider a joint distribution over these two worlds. But we only can access one such world at a time, unless we can roll back time, or jump across Everett branches.
Pearl does not concern himself with testability very much, because Pearl is a computer scientist, and to Pearl the world of Newtonian physics is like a computer circuit, where it is obvious that everything stays invariant to arbitrary counterfactual alterations of wires, and in particular sources of noise stay independent. But the applications of causal inference isn’t on circuits, but on much mushier problems—like psychology or medicine. In such domains it is not clear why assumptions like my example intuitively should hold, and not clear how to test them.
This is not naive skepticism, this is a careful account of assumptions, a very good habit among statisticians, in my opinion. We need more of this in statistical and causal analysis, not less.
Can you give me a larger context for that example? A pointer to a paper that uses it would be enough.
At the moment I’m not clear what the independence of these means, if they’re understood as statements about non-interacting world branches. What is the mathematical formulation of the assertion that they are independent? How, in mathematical terms, would that assumption play a role in the study of whether smoking causes cancer?
From another point of view, suppose that we knew the exact mechanisms whereby smoke, tar, and everything else have effects on the body leading to cancer. Would we then be able to calculate the truth or falsity of the assumption?
Since you asked for a paper, I have to cite myself:
http://arxiv.org/pdf/1205.0241v2.pdf
(there are lots of refs in there as well, for more reading).
The “branches” are interacting because they share the past, although I was being imprecise when I was talking about Everett branches—these hypothetical worlds are mathematical abstractions, and do not correspond directly to a part of the wave function at all. There is no developed extension of interventionist causality to quantum theory (nor is it clear whether this is a good idea—the intervention abstraction might not make sense in that setting).
Thanks, I now have a clearer idea of what these expressions mean and why they matter. You write on page 15:
For the A/M/Y = smoking/tar/cancer situation I can imagine a simple way of creating this situation: have someone smoke cigarettes with filters that remove all of the tar but nothing else. There may be practical engineering problems in creating such a filter, and ethical considerations in having experimental subjects smoke, but it does not seem impossible in principle. This intervention sets A to 1 and M to M(0,u), allowing the measurement of Y(1,M(0,u),u).
As with the case of the word “untestable”, I am wondering if “impossible” is here being understood to mean, not impossible in an absolute sense, but “impossible within some context of available means, assumed as part of the background”. For example, “impossible without specific domain knowledge”, or “impossible given only the causal diagram and some limited repertoire of feasible interventions and observations”. The tar filter scenario goes outside those bounds by using domain knowledge to devise a way of physically erasing the arrow from A to M.
I have the same question about page 18, where you say that equation (15):
is untestable (this is the example you expressed in words upthread), even though you have shown that it mathematically follows from any SEM of a certain form relating the variables, and could be violated if it has certain different forms. The true causal relationships, whatever they are, are observable physical processes. If we could observe them all, we would observe whether Y(1,m) _||_ M(0).
Again, by “untestable” do you here mean untestable within certain limits on what experiments can be done?
Richard, thanks for your message, and for reading my paper.
At the risk of giving you more homework, I thought I would point you to the following paper, which you might find interesting:
http://www.hsph.harvard.edu/james-robins/files/2013/03/wp100.pdf
This paper is about an argument the authors are having with Judea Pearl about whether assumptions like the one we are talking about are sensible to make. Of particular relevance for us is section 5.1. If I understood the point the authors are making, whenever Judea justifies such an assumption, he tells a story that is effectively interventional (very similar to your story about a filter). That is, what really is happening is we are replacing the graph:
A → M → Y, A → Y
by another graph:
A → A1 → Y, A → A2 → M → Y
where A1 is the “non tar-producing part” of smoking, and A2 is the “tar-producing part” of smoking (the example in 5.1 was talking about nicotine instead). As long as we can tell such a story, the relevant counterfactual is implemented via interventions, and all is well. That is, Y(A=1,M(A=0)) in graph 1 is the same thing as Y(A1=1,A2=0) in graph 2.
The point of doing mediation analysis in the first place is because we are being empiricist—using data for scientific discovery. In particular, we are trying to learn a fairly crude fact about cause-effect relationships of A, M and Y. If, as you say, we were able to observe the entire relevant DAG, and all biochemical events involved in the A → M → Y chain, then we would already be done, and would not need to do our analysis in the first place.
“Testability” (the concept I am confused about) comes up in the process of scientific work, which is crudely about expanding a lit circle of the known via sensible procedures. So intuitively, “testability” has to involve the resources of the lit circle itself, not of things in the darkness. This is because there is a danger of circularity otherwise.
I agree with this, but disagree with a number of other things you say here. I will add that sometimes the DAG is more complicated than observed exposure influenced by observed “baseline covariates” (what you call “control variables.”) Sometimes causes of observed exposure are unobserved and influence observed outcomes—but you can still get the causal effect (by a more complicated procedure than one that relies on variable adjustment).
Rubin competes with Pearl. The Rubin causal model does not compete with Pearl’s graphical models, they are the same thing. Rubin just doesn’t understand/like graphs. See here:
http://www.csss.washington.edu/Papers/wp128.pdf
Is it actually true that people using Rubin’s model will not handle time-varying confounding properly? Robins expresses time-varying confounding problems using sequential ignorability, which I think ought to be quite simple to Rubin people.
I think you are just wrong here. With all due respect to Don Rubin, propensity score matching has nothing to do with avoiding bias, it is an estimation method for a functional you get when you adjust for confounding:
p(Y(a)) = \sum_{a} p(Y | a, c) p(c)
if conditional ignorability: (Y(a) independent of A given C) and SUTVA hold.
It is just one method of many for estimating the above functional, the other being inverse weights, or the parametric g-formula, or the doubly robust methods, or whatever other ways people have invented. None of these estimation methods are going to avoid the issue if the functional above is not equal to the causal effect you want. Whether that is true has to do with whether conditional ignorability is true in your problem or not, not whether you use propensity score methods.
I don’t disagree with anything you are saying: There is nothing in the propensity score estimation method itself that makes the results less prone to bias, compared to other methods
I should have been more specific, my point was rather that the specific implementation of propensity score matching which Rubin recommends in the paper I linked to, allows the investigator to blind themselves to the outcome while assessing whether they have been able to create balanced groups. It is possible that you can do something similar with other estimation methods, but I haven’t heard anyone talk much about it, and it is not immediately obvious to me how you would go about it
Thanks for the quick clarification! I guess I am not following you. If you want to blind yourself, you can just do it—you don’t need to modify the estimator in any way, just write the computer program implementing your estimator in such a way that you don’t see the answer. This issue seems to be completely orthogonal to both causal inference and estimation. (???)
Am I missing something?
If you use the propensity score matching method, you begin by estimating the propensity score, then you match on the propensity score to create exposed and unexposed groups within levels of the propensity score. After you create those groups, there is a step where you can look at the matched groups without the outcome data, and assess whether you have achieved balance on the baseline covariates. If I understand Rubin’s students correctly, they see this as a major advantage of the estimation method.
You can obviously blind yourself to the outcome using any estimation method, but I am not sure if there is a step in the process where you look at the data without the outcome to evaluate how confident you are in your work.
In order for the estimand of the propensity score method to be unbiased, the following is sufficient:
(a) SUTVA (this is untestable)
(b) Conditional ignorability (this is testable in principle, but only if we randomize the exposure A)
(c) The treatment assignment probability model (that is the model for p(A | C), where A is exposure and C is baseline covariates) must be correct.
It may be that the “balance property” tests a part of (b), but surely not all of it! That is, the arms might look balanced, but conditional ignorability might still not hold. We cannot test all the assumptions we need to draw causal conclusions from observational data using only observational data. Causal assumptions have to enter in somewhere!
I think I might be confused about why checking for balance without working out the effect is an advantage—but I will think about it, because I am not an expert on propensity score methods, so there is probably something I am missing.