Effect heterogeneity and external validity in medicine
Our paper “Effect heterogeneity and variable selection for standardizing causal effects to a target population” has just been publised in the European Journal of Epidemiology at https://link.springer.com/article/10.1007/s10654-019-00571-w . While the journal’s version of record is behind a paywall, a preprint is available on arXiv at https://arxiv.org/pdf/1610.00068.pdf.
This paper argues for my very deeply held belief that we can make significant advances in quantitative reasoning for medical decision making by thinking more closely about effect heterogeneity and how this relates to the choice of effect scale.
Over the course of the last 7 years, external validity and generalizability have become increasingly hot topics in statistical methodology and computer science. In particular, a lot of progress has been made by Judea Pearl and Elias Bareinboim, who introduced a framework based on causal diagrams that can be used to reason about how to take causal information from one setting (for example: a randomized trial) and apply it in a different setting (for example: a clinically relevant target population).
The key questions of interest are: How do we know whether such extrapolation is even possible? How do we determine what information we need from the study, and what information we need from the target population, in order to extrapolate the findings? How do we put this information together in order to obtain a valid prediction for what happens if the intervention is implemented in the target population?
Pearl and Bareinboim’s framework for answering these questions is, of course, mathematically valid. However, in my opinion, their approach also throws the baby out with the bathwater. In particular, we argue that instead of attempting to extrapolate the magnitude of the effect (i.e. a measure of the “size” of the difference between what happens if the drug is taken, and what happens if the drug is not taken), they attempt to look at the people who were assigned to receive the intervention in the study, and extrapolate their distribution of outcomes to the target population without any reference to how that distribution differs from what happened to the people who were randomized to the control condition.
Theoretically, this approach will work if the extrapolation procedure can account for every cause of the outcome whose distribution differs between the people who were in the study, and the people you are trying to make predictions about. However, since the set of causes of the outcome is very large, it is very unlikely that it is possible to measure all of them. Moreover, it is very likely that we do not even know what the causes of the outcome are. Our inferences then become subject to potential uncertainty which arises from the auxiliary assumption that we know what to control for.
Consider a situation where scientists have conducted a randomized controlled trial in men, on the effects of homeopathy on heart disease. The scientists find that homeopathy has no effects in men, and wonder whether this finding can be extrapolated to women. If the scientists attempt to answer this question using Bareinboim and Pearl’s framework, they will be forced to conclude that no extrapolation can be made, unless they are willing to claim that they know all the causes of heart disease that differ between men and women, and have been able to measure every one of these causes in all the patients in the study.
In contrast, we suggest that scientists who want to extrapolate their findings and make predictions outside of the study should attempt to quantify the size of the effect—that is, by how much the outcomes in the people who were randomized to receive the intervention differ from the outcomes in the people who were randomized to the control condition. This effect size could then potentially be used as the basis for extrapolation. Such an approach would correspond much closer to how external validity and extrapolation has traditionally been understood in the medical literature.
In the real world of clinical medicine, doctors are usually given information about the effects of a drug on the risk ratio scale (the probability of the outcome if treated, divided by the probability of the outcome if untreated). With information on the risk ratio, a doctor may make a prediction for what will happen to the patient if treated, by multiplying the risk ratio and patient’s risk if untreated (which is predicted informally based on observable markers for the patient’s condition).
The problem with this approach is that there are multiple scales on which to quantify the magnitude of the effect. Other possible scales for measuring effects include:
The odds ratio, which applies a transformation to the risk
The survival ratio, which uses the probability of survival (1-p) instead of the probability of death (p)
The risk difference (which uses an additive scale instead of a multiplicative one)
Unless the intervention has no effect, the empirical predictions will not be invariant to the choice of scale. This is, of course, a serious problem for principled clinical decision-making, but as we will show, it is not necessarily an impossible one.
Despite the scale dependence of the reasoning procedure, the risk ratio is in many cases the only summary of the effect size that is made available to clinicians, whether they get their information from journals, clinical guidelines or online resources for clinical information. Given that the reasoning procedure is not scale-invariant, the universal reliance on the risk ratio may plausibly lead to suboptimal medical decision making in a wide range of clinical scenarios. But, in contrast to the implications of the Bareinboim/Pearl framework, we argue that this does not necessarily mean that we should throw out reliance on parametric effect measures altogether.
Our suggestion for how to choose the scale has been discussed earlier on Less Wrong (see https://www.lesswrong.com/posts/K3d93AfFE5owfpkx4/counterfactual-outcome-state-transition-parameters ). I am not going to repeat the argument in full here, but I will ask you to consider the following highly stylized thought experiment, which illustrates the underlying intuition:
Consider a randomized controlled trial where the intervention is that everyone is randomized to play Russian roulette once a year. This trial is conducted in Russia. It is found that among those who did not play Russian roulette, 1% of people died over the course of the year. Among the people who played Russian roulette, 18% of people died. We want to extrapolate these findings to Norway, where nobody ordinarily plays Russian roulette and it is known that 0.5% of people die during any year. Our goal is to find out what happens in Norway if everyone took up playing Russian roulette once a year.
Bareinboim and Pearl would suggest taking the risk of death among those who played Russian roulette (18%), controlling for all causes of death that differ between Russia and Norway, and producing an estimate for what happens in Norway if everyone plays Russian roulette. However, due to considerable differences between Russia and Norway in terms of predictors of mortality, this is clearly not feasible in this situation.
If we instead attempt to quantify the effect size in Russia, this can be done on any of the previously discussed scales:
The risk ratio is
The risk difference is
The survival ratio is
The odds ratio is
Each of these scales will result in a different prediction for what will happen if people in Norway play Russian roulette:
If we use the risk ratio, we will predict that will die.
If we use the risk difference, we will predict that will die.
If we use the survival ratio, we will predict that will survive, meaning that 17.1% will die
If we use the odds ratio, we will predict that will die.
These predictions differ massively not only in their implications for decision-making but also in their plausibility: Given what we know about Russian roulette, we would expect to see results much closer to 17% than to 9%. So clearly, some of these scales are doing something “right” and other scales are doing something “wrong”.
We argue that the key to understanding the implications of this scale-dependence is that only the survival ratio ( ) has a structural meaning: it represents the proportion of empty chambers in the revolver, and therefore produces appropriate, valid predictions. In contrast, the risk ratio () has no possible structural meaning and therefore produces nonsense results.
Any attempt at extrapolation would, of course, have to account for all factors that determine the magnitude of the effect. For example, if Russians are more likely to be drunk when they play Russian roulette, they may be more likely to miss than Norwegians. This may lead to local deviations from effect sizes of , which will have implications for extrapolation. But once you have controlled for all of the factors that determine the magnitude of the effect on a scale that has structural meaning, extrapolation may be valid.
Crucially, we argue that controlling for all determinants of effect size (alcohol? how many chambers are there in typical revolvers in each country?) is much more tractable than controlling for all causes of mortality differences between the countries.
The main idea behind my research agenda is to explore how far we can push this argument in more clinically relevant settings. Next, consider a doctor who is trying to determine the pros and cons of treating a patient with a new drug. Suppose a reliable study on the drug shows that among those who received a placebo, 1% got an allergic reaction over the following 12 months; whereas, among those who received the drug, 2% got an allergic reaction.
The scientists behind the study can either tell the doctor that the risk ratio is , or that the survival ratio is . Both statements are correct, but only the latter has a potential structural interpretation, since it plausibly corresponds to a state of nature where 99% of the population do not have the factors (genes?) that predispose a person to have an allergic reaction if exposed to the drug.
Now consider that this patient also has a severe peanut allergy (which is unrelated to the medical issues that the doctor is treating them for) and lives in an environment where everyone eats peanuts all the time. This patient, therefore, has a 10% baseline risk of getting an allergic reaction over the course of 12 months, even in the absence of treatment with the new drug.
It would be insanity for the doctor to expect that the risk ratio from the study generalizes, and that the patient will have a 20% risk of anaphylaxis if given the new drug. In contrast, it may be meaningful to predict that their risk under treatment is given by . This will correspond closely to what one might expect would happen if the patient belongs to a population that has the same distributions of factors that predispose to the specific drug-related allergic reaction, as the population that was studied in the trial.
For these reasons, I consider it crucial for medical scientists to become aware of the need to put significant effort into reasoning about whether an effect measure has plausible structural meaning in the context of their current research question, before deciding to use it as a summary of their findings which is suitable for use in clinical decision making.
If anyone can spot any flaws in our argument, such feedback would be invaluable information. I invoke Crocker’s Rules for all responses to the paper and the post. I would very much appreciate it if this blog post and the paper could be forwarded to anyone who is in a position to evaluate its importance.
Finally, let me note that this paper is the first peer-reviewed academic publication to acknowledge support from the EA Hotel Blackpool in its funding section. The EA Hotel is a project worth supporting; see https://forum.effectivealtruism.org/posts/uyvc6p99vsWFMPZiz/ea-hotel-fundraiser-5-out-of-runway
- EA Hotel Fundraiser 6: Concrete outputs after 17 months by 31 Oct 2019 21:39 UTC; 78 points) (EA Forum;
- Generalizing Experimental Results by Leveraging Knowledge of Mechanisms by 11 Dec 2019 20:39 UTC; 50 points) (
- Shall we count the living or the dead? by 14 Jun 2021 0:38 UTC; 30 points) (
- Effect heterogeneity and external validity by 27 Oct 2019 11:30 UTC; 6 points) (EA Forum;
- 14 Jun 2021 20:18 UTC; 2 points) 's comment on Shall we count the living or the dead? by (
It seems like the “causal diagrams for transportability” section in the paper is the obviously-correct way to handle the problem. Why do we need the rest of the paper at all?
I am curious why you think the approach based on causal diagrams is obviously correct. Would you be able to unpack this for me?
Does it not bother you that this approach fails to find a solution (i.e. won’t make any predictions at all) if there are unmeasured causes of the outcome, even if treatment has no effect?
Does it not bother you that it fails to find a solution to the Russian roulette example, because the approach insists on treating “what happens if treated” and “what happens if untreated” as separate problems, and therefore fails to make use of information about how much the outcomes differs by between the two treatment options?
Does it not seem useful to have an alternative approach that is able make use of all the intuition that says we should be able to make such extrapolations? An alternative approach that formalizes the intuition that led all the pre-Pearl literature to consider the problem in terms of the magnitude of the effect, not in terms of the individual counterfactual distributions?
The key issue is that we’re asking a counterfactual question. The question itself will be underdefined without the context of a causal model. The Russian roulette hypothetical is a good example: “Our goal is to find out what happens in Norway if everyone took up playing Russian roulette once a year”. What does this actually mean? Are we asking what would happen if some mad dictator forced everyone to play Russian roulette? Or if some Russian roulette social media craze caught on? Or if people became suicidal en-masse and Russian roulette became popular accordingly? These are different counterfactuals, and the answer will be different depending on which of these we’re talking about. We need the machinery of counterfactuals—and therefore the machinery of causal models—in order to define what we mean at all by “what happens in Norway if everyone took up playing Russian roulette once a year”. That counterfactual only makes sense at all in the context of a causal model, and is underdefined otherwise.
I assume by “unmeasured causes” you mean latent variables—i.e. variables in the causal graph which happened to not be observed. A causal diagram framework can handle latent variables just fine; there is no fundamental reason why every variable needs to be measured. Latent variables are a pain computationally, but they pose no fundamental problem mathematically. Indeed, much of machine learning consists of causal models with latent variables.
Whether the treatment has an effect does not seem relevant here at all.
No. My intuition very strongly says that 100% of the relevant structural information/model can be directly captured by causal models, and that you’re just not used to encoding these sorts of intuitions into causal models. Indeed, counterfactuals are needed even to define what we mean, as in the Russian roulette example. The individual counterfactual distributions really are the thing we care about, and everything else is relevant only insofar as it approximates those counterfactual distributions in some situations.
Overall, my impression is that you don’t actually understand how to build causal models, and you are very confused about their applicability and limitations.
(Side note: I know you invoked Crocker’s rules, but I still feel kinda bad being this harsh, so… I definitely get the impression that you guys are smart folks. I think you’re misunderstanding causal models pretty severely, and I think you’re going to kick yourselves once you understand it better, but you clearly have the intelligence and will to do this sort of work. This was just an unlucky wiff.)
Hi Anders and John, we just posted for discussion (here on LessWrong https://www.lesswrong.com/posts/pbuDzeqHSXkkAdWY7/ ) a note addressing Anders’ concerns (alternatively, direct link for the note here). Would love to hear your thoughts.
Best, Carlos
I absolutely agree that this is a counterfactual question. I am using the machinery of counterfactuals and causal models, just a different causal model from the one you and Pearl prefer. In this case, I had in mind a situation that is roughly equivalent to a mad dictator forcing everyone to play Russian roulette, but the underspecified details are not all that important to the argument I am making.
This is straight up wrong, and on this particular point the causal inference establishment is on my side, not yours. For example, if there are backdoor paths that cannot be closed without conditioning on a latent variable, then the causal effect is not identified and there is no amount of computation that can get around this.
Much of machine learning gets causality wrong.
It is relevant because it allows me to construct a very simple scenario where we have very strong intuition that extrapolation should work; yet Pearl’s selection diagram fails to make a prediction for the target population.
I agree that you can encode all structural information in causal models. I do not agree that all structural information can be encoded in DAGs, which are one particular type of causal model. There are several examples of background information about the causal structure, which are essential for identifiability and which cannot be encoded on standard DAGs. For example, monotonicity is necessary for instrumental variable identification.
I am arguing that there is a special type of background information that is crucial for generalizability, and which cannot be encoded in Pearl/Bareinboim’s causal diagrams for transportability. I therefore proposed a non-DAG causal model which is able to use this background structural knowledge. The Russian roulette example is an attempt to illustrate the nature of this class of background knowledge.
This does not mean that it is impossible to make an extension of the causal DAG framework to encode the same information. I am just arguing that this is not what the Pearl/Bareinboim selection diagram framework does.
I did specifically invoke Crocker’s Rules, so I’d like to thank you for this feedback.
Of course, I think you are wrong about this. I dislike appeals to authority, but I would like to point out that I have a doctoral degree in epidemiologic methodology from Harvard, and that my thesis advisors were genuine thought leaders in causal modelling. I also want to point out that both my papers on this topic have been reviewed by editors and peer-reviewers with a deep understanding of causal models.
This does of course not necessarily mean that you are wrong. It does however mean that I think you should adjust your priors and truly try to understand my argument before you reach such a strong posterior.
If you genuinely have found a flaw in my argument, I’d like you to state it explicitly rather than just claim that I don’t understand causal models. In a hypothetical world in which I am wrong, I would very much like to know about it, as it would allow me to move on and work on something else.
Alright, updated on the chance that you actually know what you’re doing, although I still think it’s wildly unlikely that this Russian roulette example actually illustrates anything which is both useful and cannot be captured by DAGs.
The obvious causal model for the Russian roulette example is one with four nodes:
first node indicating whether roulette is played
second node, child of first, indicating whether roulette killed
third node, child of second, indicating whether some other cause killed (can only happen if the person survived roulette)
fourth node, death, child of second and third node
This makes sense physically, has a well-defined counterfactual for Norway, and produces the risk difference calculation from the post. What information is missing?
Sure, the DAG isn’t going to contain all the information—you’ll usually have some information about the DAG, e.g. prior info about the DAG structure or about particular nodes within the DAG. But that’s still info about the DAG—throwing away the DAG itself is still a step backwards. The underlying structure of reality is still a DAG, it’s only our information about reality which will be non-DAG-shaped. DAGs show the causal structure, Bayesian probability handles whatever info we have about the DAGs.
Identifiability, sure. But latents still aren’t a problem for either extrapolation or model testing, as long as we’re using Bayesian inference. We don’t need identifiability.
I am not using Bayesian inference, and neither are Pearl and Bareinboim. Their graphical framework (“selection diagrams”) is very explicitly set up as model for reasoning about whether the causal effect in the target population is identified in terms of observed data from the study population and observed data from the target population. Such identification may succeed or fail depending on latent variables and depending on the causal structure of the selection diagram.
I am confident that Pearl and Bareinboim would not disagree with me about the preceding paragraph. The point of disagreement is whether there are realistic ways to substantially reduce the set of variables that must be measured, by using background knowledge about the causal structure that cannot be represented on selection diagrams.
In my model of reality (and I am sure, in most other people’s model of reality), the third node has a wide range of unobserved latent ancestors. If the goal is to make inferences about the effect of Russian roulette in Russia using data from Russia, your analytic objective will be to find a set of nodes that d-separate the first node from the fourth node. You do not need to condition on the latent causes of the third node to achieve this (because those latent variables are not also causes of the first node- they cannot be, because the first node was randomized). The identification formula for the effect in Russia is therefore invariant to whether the latent causes of the third node are represented on the graph or not, and you therefore do not have to show them. The DAG model then represents a huge equivalence class of causal models; you can be agnostic between causal models within this equivalence class because the inferences are invariant between them.
But if the goal is to make predictions about the effect in Norway using data from Russia, these latent variables suddenly become relevant. The goal is no longer to d-separate the fourth node from the first node, but to d-separate the fourth node from an indicator for whether a person lives in Russia or Norway. In the true data generating mechanism (i.e. in the reality that the model is trying to represent), there almost certainly are a substantial number of open paths between the indicator for whether a person lives in Norway or Russia and their risk of death. The only possible identification formula for the effect in Russia includes terms for distributions that are conditional on the latent variables. The effect in Norway is therefore not identified from the Russian data.
I agree that reality is generated by a structure that looks something like a directed acyclic graph. But that does not mean that all significant aspects of reality can be modeled using Pearl’s specific operationalization of causal DAGs/selection diagrams.
Any attempt to extrapolate from Russia to Norway is going to depend on a background belief that some aspect of the data generating structure is equal between the countries. In the case of Russian roulette, I argue that the natural choice of mathematical object to hang our claims to structural equality on, is the parameter that takes the value 5⁄6 in both countries.
In DAG terms, you can think of the data generating mechanism for node 4 as responding to a property of the path 1->2->4. In particular, this path forces the quantities Pr(Fourth node =0 | do(First node=1)) and Pr(Fourth node =0 | do(First node=0)) to be related by a factor of 5⁄6 in both countries. Reality still has a DAG structure, but you won’t find a way to encode the figure 5⁄6 in a causal model based only on selection diagrams. Without a way to encode a parameter that takes the value 5⁄6, you have to take a long detour where you collect a truckload of data and measure all the latent variables.
Ok, I think that’s the main issue here. As a criticism of Pearl and Bareinboim, I agree this is basically valid. That said, I’d still say that throwing out DAGs is a terrible way to handle the issue—Bayesian inference with DAGs is the right approach for this sort of problem.
The whole argument about identification problems then becomes irrelevant, as it should. Sometimes the true model of reality is not identifiable. This is not a problem with reality, and pretending some other model generates reality is not the way to fix it. The way to fix it is to use an inference procedure which does not assume identifiability.
The equality of this parameter is not sufficient to make the prediction we want to make—the counterfactual is still underspecified. The survival ratio calculation will only be correct if a particular DAG and counterfactual apply, and will be incorrect otherwise. By not using a DAG, it becomes unclear what assumptions we’re even making—it’s not at all clear what counterfactual we’re using, or whether there even is a well-defined counterfactual for which the calculation is correct.
I am not throwing out DAGs. I am just claiming that the particular aspect of reality that I think justifies extrapolation cannot be represented on a standard DAG. While I formalized my causal model for these aspects of reality without using graphs, I am confident that there exists a way to represent the same structural constraints in a DAG model. It is just that nobody has done it yet.
As for combining Bayesian inference and DAGs: This is one of those ideas that sounds great in principle, but where the details get very messy. I don’t have a good enough understanding of Bayesian statistics to make the argument in full, but I do know that very smart people have tried to combine it with causal models and concluded that it doesn’t work. Bayesianism therefore plays essentially no role in the causal modelling literature. If you believe you have an obvious solution to this, I recommend you write it up and submit to a journal, because you will get a very impactful publication out of it.
In a country where nobody plays Russian roulette, you have valid data on the distribution of outcomes under the scenario where nobody plays Russian roulette (due to simple consistency). In combination with knowledge about the survival ratio, this is sufficient to make a prediction for the distribution of outcomes in a counterfactual where everybody plays Russian roulette.
Do you know of any references on the problems people have run into? I’ve used Bayesian inference on causal models in my own day-to-day work quite a bit without running into any fundamental issues (other than computational difficulty), and what I’ve read of people using them in neuroscience and ML generally seems to match that. So it sounds like knowledge has failed to diffuse—either the folks using this stuff haven’t heard about some class of problems with it, or the causal modelling folks are insufficiently steeped in Bayesian inference to handle the tricky bits.
I don’t have a great reference for this.
A place to start might be Judea Pearl’s essay “Why I’m only half-Bayesian” at https://ftp.cs.ucla.edu/pub/stat_ser/r284-reprint.pdf . If you look at his Twitter account at @yudapearl, you will also see numerous tweets where he refers to Bayes Theorem as a “trivial identity” and where he talks about Bayesian statistics as “spraying priors on everything”. See for example https://twitter.com/yudapearl/status/1143118757126000640 and his discussions with Frank Harrell.
Another good read may be Robins, Hernan and Wasserman’s letter to the editor at Biometrics, https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4667748/ . While that letter is not about graphical models, the propensity scores/marginal structural models are mathematically very closely related. The main argument in that letter (which was originally a blog post) has been discussed on Less Wrong before; I am trying to find the discussion, it may be this link https://www.lesswrong.com/posts/xdh5FPMYYGGX7PBKj/the-trouble-with-bayes-draft
From my perspective, as someone who is not well trained in Bayesian methods and does not pretend to understand the issue well, I just observe that methodological work on causal models very rarely uses Bayesian statistics, that I myself do not see an obvious way to integrate it, and that most of the smart people working on causal inference appear to be skeptical of such attempts
Ok, after reading these, it’s sounding a lot more like the main problem is causal inference people not being very steeped in Bayesian inference. Robins, Hernan and Wasserman’s argument is based on a mistake that took all of ten minutes to spot: they show that a particular quantity is independent of propensity score function if the true parameters of the model are known, then jump to the estimate of that quantity being independent of propensity—when in fact, the estimate is dependent on the propensity, because the estimates of the model parameters depend on propensity. Pearl’s argument is more abstract and IMO stronger, but is based on the idea that causal relationships are not statistically testable… when in fact, that’s basically the bread-and-butter use-case for Bayesian model comparison.
Some time in the next week I’ll write up a post with a few full examples (including the one from Robins, Hernan and Wasserman), and explain in a bit more detail.
(Side note: I suspect that the reason smart people have had so much trouble here is that the previous generation was mostly introduced to Bayesian statistics by Savage or Gelman; I expect someone who started with Jaynes would have a lot less trouble here, but his main textbook is relatively recent.)
I look forward to reading it. To be honest: Knowing these authors, I’d be surprised if you have found an error that breaks their argument.
We are now discussing questions that are so far outside of my expertise that I do not have the ability to independently evaluate the arguments, so I am unlikely to contribute further to this particular subthread (i.e. to the discussion about whether there exists an obvious and superior Bayesian solution to the problem I am trying to solve).
UPDATE Dec 2019: Based on Cinelli & Pearls response to the OP (& associated paper), it does indeed look like all the relevant information can be integrated into a DAG model.
Over the course of this thread, I came to the impression that an unnecessary focus on identifiability was the main root problem with the OP. Now it looks like that was probably wrong. However, based on the Cinelli & Pearl paper, it does look like causal DAGs + Bayesian probability (or even non-Bayesian probability, for the example at hand) are all we need for this use-case.