Alright, updated on the chance that you actually know what you’re doing, although I still think it’s wildly unlikely that this Russian roulette example actually illustrates anything which is both useful and cannot be captured by DAGs.
The obvious causal model for the Russian roulette example is one with four nodes:
first node indicating whether roulette is played
second node, child of first, indicating whether roulette killed
third node, child of second, indicating whether some other cause killed (can only happen if the person survived roulette)
fourth node, death, child of second and third node
This makes sense physically, has a well-defined counterfactual for Norway, and produces the risk difference calculation from the post. What information is missing?
There are several examples of background information about the causal structure, which are essential for identifiability and which cannot be encoded on standard DAGs. For example, monotonicity is necessary for instrumental variable identification.
Sure, the DAG isn’t going to contain all the information—you’ll usually have some information about the DAG, e.g. prior info about the DAG structure or about particular nodes within the DAG. But that’s still info about the DAG—throwing away the DAG itself is still a step backwards. The underlying structure of reality is still a DAG, it’s only our information about reality which will be non-DAG-shaped. DAGs show the causal structure, Bayesian probability handles whatever info we have about the DAGs.
For example, if there are backdoor paths that cannot be closed without conditioning on a latent variable, then the causal effect is not identified and there is no amount of computation that can get around this.
Identifiability, sure. But latents still aren’t a problem for either extrapolation or model testing, as long as we’re using Bayesian inference. We don’t need identifiability.
Identifiability, sure. But latents still aren’t a problem for either extrapolation or model testing, as long as we’re using Bayesian inference. We don’t need identifiability.
I am not using Bayesian inference, and neither are Pearl and Bareinboim. Their graphical framework (“selection diagrams”) is very explicitly set up as model for reasoning about whether the causal effect in the target population is identified in terms of observed data from the study population and observed data from the target population. Such identification may succeed or fail depending on latent variables and depending on the causal structure of the selection diagram.
I am confident that Pearl and Bareinboim would not disagree with me about the preceding paragraph. The point of disagreement is whether there are realistic ways to substantially reduce the set of variables that must be measured, by using background knowledge about the causal structure that cannot be represented on selection diagrams.
The obvious causal model for the Russian roulette example is one with four nodes:
first node indicating whether roulette is played
second node, child of first, indicating whether roulette killed
third node, child of second, indicating whether some other cause killed (can only happen if the person survived roulette)
fourth node, death, child of second and third node
This makes sense physically, has a well-defined counterfactual for Norway, and produces the risk difference calculation from the post. What information is missing?
In my model of reality (and I am sure, in most other people’s model of reality), the third node has a wide range of unobserved latent ancestors. If the goal is to make inferences about the effect of Russian roulette in Russia using data from Russia, your analytic objective will be to find a set of nodes that d-separate the first node from the fourth node. You do not need to condition on the latent causes of the third node to achieve this (because those latent variables are not also causes of the first node- they cannot be, because the first node was randomized). The identification formula for the effect in Russia is therefore invariant to whether the latent causes of the third node are represented on the graph or not, and you therefore do not have to show them. The DAG model then represents a huge equivalence class of causal models; you can be agnostic between causal models within this equivalence class because the inferences are invariant between them.
But if the goal is to make predictions about the effect in Norway using data from Russia, these latent variables suddenly become relevant. The goal is no longer to d-separate the fourth node from the first node, but to d-separate the fourth node from an indicator for whether a person lives in Russia or Norway. In the true data generating mechanism (i.e. in the reality that the model is trying to represent), there almost certainly are a substantial number of open paths between the indicator for whether a person lives in Norway or Russia and their risk of death. The only possible identification formula for the effect in Russia includes terms for distributions that are conditional on the latent variables. The effect in Norway is therefore not identified from the Russian data.
The underlying structure of reality is still a DAG, it’s only our information about reality which will be non-DAG-shaped. DAGs show the causal structure
I agree that reality is generated by a structure that looks something like a directed acyclic graph. But that does not mean that all significant aspects of reality can be modeled using Pearl’s specific operationalization of causal DAGs/selection diagrams.
Any attempt to extrapolate from Russia to Norway is going to depend on a background belief that some aspect of the data generating structure is equal between the countries. In the case of Russian roulette, I argue that the natural choice of mathematical object to hang our claims to structural equality on, is the parameter that takes the value 5⁄6 in both countries.
In DAG terms, you can think of the data generating mechanism for node 4 as responding to a property of the path 1->2->4. In particular, this path forces the quantities Pr(Fourth node =0 | do(First node=1)) and Pr(Fourth node =0 | do(First node=0)) to be related by a factor of 5⁄6 in both countries. Reality still has a DAG structure, but you won’t find a way to encode the figure 5⁄6 in a causal model based only on selection diagrams. Without a way to encode a parameter that takes the value 5⁄6, you have to take a long detour where you collect a truckload of data and measure all the latent variables.
I am not using Bayesian inference, and neither are Pearl and Bareinboim.
Ok, I think that’s the main issue here. As a criticism of Pearl and Bareinboim, I agree this is basically valid. That said, I’d still say that throwing out DAGs is a terrible way to handle the issue—Bayesian inference with DAGs is the right approach for this sort of problem.
The whole argument about identification problems then becomes irrelevant, as it should. Sometimes the true model of reality is not identifiable. This is not a problem with reality, and pretending some other model generates reality is not the way to fix it. The way to fix it is to use an inference procedure which does not assume identifiability.
Any attempt to extrapolate from Russia to Norway is going to depend on a background belief that some aspect of the data generating structure is equal between the countries. In the case of Russian roulette, I argue that the natural choice of mathematical object to hang our claims to structural equality on, is the parameter that takes the value 5⁄6 in both countries.
The equality of this parameter is not sufficient to make the prediction we want to make—the counterfactual is still underspecified. The survival ratio calculation will only be correct if a particular DAG and counterfactual apply, and will be incorrect otherwise. By not using a DAG, it becomes unclear what assumptions we’re even making—it’s not at all clear what counterfactual we’re using, or whether there even is a well-defined counterfactual for which the calculation is correct.
Ok, I think that’s the main issue here. As a criticism of Pearl and Bareinboim, I agree this is basically valid. That said, I’d still say that throwing out DAGs is a terrible way to handle the issue—Bayesian inference with DAGs is the right approach for this sort of problem.
I am not throwing out DAGs. I am just claiming that the particular aspect of reality that I think justifies extrapolation cannot be represented on a standard DAG. While I formalized my causal model for these aspects of reality without using graphs, I am confident that there exists a way to represent the same structural constraints in a DAG model. It is just that nobody has done it yet.
As for combining Bayesian inference and DAGs: This is one of those ideas that sounds great in principle, but where the details get very messy. I don’t have a good enough understanding of Bayesian statistics to make the argument in full, but I do know that very smart people have tried to combine it with causal models and concluded that it doesn’t work. Bayesianism therefore plays essentially no role in the causal modelling literature. If you believe you have an obvious solution to this, I recommend you write it up and submit to a journal, because you will get a very impactful publication out of it.
The equality of this parameter is not sufficient to make the prediction we want to make—the counterfactual is still underspecified. The survival ratio calculation will only be correct if a particular DAG and counterfactual apply, and will be incorrect otherwise.
In a country where nobody plays Russian roulette, you have valid data on the distribution of outcomes under the scenario where nobody plays Russian roulette (due to simple consistency). In combination with knowledge about the survival ratio, this is sufficient to make a prediction for the distribution of outcomes in a counterfactual where everybody plays Russian roulette.
I don’t have a good enough understanding of Bayesian statistics to make the argument in full, but I do know that very smart people have tried to combine it with causal models and concluded that it doesn’t work.
Do you know of any references on the problems people have run into? I’ve used Bayesian inference on causal models in my own day-to-day work quite a bit without running into any fundamental issues (other than computational difficulty), and what I’ve read of people using them in neuroscience and ML generally seems to match that. So it sounds like knowledge has failed to diffuse—either the folks using this stuff haven’t heard about some class of problems with it, or the causal modelling folks are insufficiently steeped in Bayesian inference to handle the tricky bits.
From my perspective, as someone who is not well trained in Bayesian methods and does not pretend to understand the issue well, I just observe that methodological work on causal models very rarely uses Bayesian statistics, that I myself do not see an obvious way to integrate it, and that most of the smart people working on causal inference appear to be skeptical of such attempts
Ok, after reading these, it’s sounding a lot more like the main problem is causal inference people not being very steeped in Bayesian inference. Robins, Hernan and Wasserman’s argument is based on a mistake that took all of ten minutes to spot: they show that a particular quantity is independent of propensity score function if the true parameters of the model are known, then jump to the estimate of that quantity being independent of propensity—when in fact, the estimate is dependent on the propensity, because the estimates of the model parameters depend on propensity. Pearl’s argument is more abstract and IMO stronger, but is based on the idea that causal relationships are not statistically testable… when in fact, that’s basically the bread-and-butter use-case for Bayesian model comparison.
Some time in the next week I’ll write up a post with a few full examples (including the one from Robins, Hernan and Wasserman), and explain in a bit more detail.
(Side note: I suspect that the reason smart people have had so much trouble here is that the previous generation was mostly introduced to Bayesian statistics by Savage or Gelman; I expect someone who started with Jaynes would have a lot less trouble here, but his main textbook is relatively recent.)
Some time in the next week I’ll write up a post with a few full examples (including the one from Robins, Hernan and Wasserman), and explain in a bit more detail.
I look forward to reading it. To be honest: Knowing these authors, I’d be surprised if you have found an error that breaks their argument.
We are now discussing questions that are so far outside of my expertise that I do not have the ability to independently evaluate the arguments, so I am unlikely to contribute further to this particular subthread (i.e. to the discussion about whether there exists an obvious and superior Bayesian solution to the problem I am trying to solve).
Alright, updated on the chance that you actually know what you’re doing, although I still think it’s wildly unlikely that this Russian roulette example actually illustrates anything which is both useful and cannot be captured by DAGs.
The obvious causal model for the Russian roulette example is one with four nodes:
first node indicating whether roulette is played
second node, child of first, indicating whether roulette killed
third node, child of second, indicating whether some other cause killed (can only happen if the person survived roulette)
fourth node, death, child of second and third node
This makes sense physically, has a well-defined counterfactual for Norway, and produces the risk difference calculation from the post. What information is missing?
Sure, the DAG isn’t going to contain all the information—you’ll usually have some information about the DAG, e.g. prior info about the DAG structure or about particular nodes within the DAG. But that’s still info about the DAG—throwing away the DAG itself is still a step backwards. The underlying structure of reality is still a DAG, it’s only our information about reality which will be non-DAG-shaped. DAGs show the causal structure, Bayesian probability handles whatever info we have about the DAGs.
Identifiability, sure. But latents still aren’t a problem for either extrapolation or model testing, as long as we’re using Bayesian inference. We don’t need identifiability.
I am not using Bayesian inference, and neither are Pearl and Bareinboim. Their graphical framework (“selection diagrams”) is very explicitly set up as model for reasoning about whether the causal effect in the target population is identified in terms of observed data from the study population and observed data from the target population. Such identification may succeed or fail depending on latent variables and depending on the causal structure of the selection diagram.
I am confident that Pearl and Bareinboim would not disagree with me about the preceding paragraph. The point of disagreement is whether there are realistic ways to substantially reduce the set of variables that must be measured, by using background knowledge about the causal structure that cannot be represented on selection diagrams.
In my model of reality (and I am sure, in most other people’s model of reality), the third node has a wide range of unobserved latent ancestors. If the goal is to make inferences about the effect of Russian roulette in Russia using data from Russia, your analytic objective will be to find a set of nodes that d-separate the first node from the fourth node. You do not need to condition on the latent causes of the third node to achieve this (because those latent variables are not also causes of the first node- they cannot be, because the first node was randomized). The identification formula for the effect in Russia is therefore invariant to whether the latent causes of the third node are represented on the graph or not, and you therefore do not have to show them. The DAG model then represents a huge equivalence class of causal models; you can be agnostic between causal models within this equivalence class because the inferences are invariant between them.
But if the goal is to make predictions about the effect in Norway using data from Russia, these latent variables suddenly become relevant. The goal is no longer to d-separate the fourth node from the first node, but to d-separate the fourth node from an indicator for whether a person lives in Russia or Norway. In the true data generating mechanism (i.e. in the reality that the model is trying to represent), there almost certainly are a substantial number of open paths between the indicator for whether a person lives in Norway or Russia and their risk of death. The only possible identification formula for the effect in Russia includes terms for distributions that are conditional on the latent variables. The effect in Norway is therefore not identified from the Russian data.
I agree that reality is generated by a structure that looks something like a directed acyclic graph. But that does not mean that all significant aspects of reality can be modeled using Pearl’s specific operationalization of causal DAGs/selection diagrams.
Any attempt to extrapolate from Russia to Norway is going to depend on a background belief that some aspect of the data generating structure is equal between the countries. In the case of Russian roulette, I argue that the natural choice of mathematical object to hang our claims to structural equality on, is the parameter that takes the value 5⁄6 in both countries.
In DAG terms, you can think of the data generating mechanism for node 4 as responding to a property of the path 1->2->4. In particular, this path forces the quantities Pr(Fourth node =0 | do(First node=1)) and Pr(Fourth node =0 | do(First node=0)) to be related by a factor of 5⁄6 in both countries. Reality still has a DAG structure, but you won’t find a way to encode the figure 5⁄6 in a causal model based only on selection diagrams. Without a way to encode a parameter that takes the value 5⁄6, you have to take a long detour where you collect a truckload of data and measure all the latent variables.
Ok, I think that’s the main issue here. As a criticism of Pearl and Bareinboim, I agree this is basically valid. That said, I’d still say that throwing out DAGs is a terrible way to handle the issue—Bayesian inference with DAGs is the right approach for this sort of problem.
The whole argument about identification problems then becomes irrelevant, as it should. Sometimes the true model of reality is not identifiable. This is not a problem with reality, and pretending some other model generates reality is not the way to fix it. The way to fix it is to use an inference procedure which does not assume identifiability.
The equality of this parameter is not sufficient to make the prediction we want to make—the counterfactual is still underspecified. The survival ratio calculation will only be correct if a particular DAG and counterfactual apply, and will be incorrect otherwise. By not using a DAG, it becomes unclear what assumptions we’re even making—it’s not at all clear what counterfactual we’re using, or whether there even is a well-defined counterfactual for which the calculation is correct.
I am not throwing out DAGs. I am just claiming that the particular aspect of reality that I think justifies extrapolation cannot be represented on a standard DAG. While I formalized my causal model for these aspects of reality without using graphs, I am confident that there exists a way to represent the same structural constraints in a DAG model. It is just that nobody has done it yet.
As for combining Bayesian inference and DAGs: This is one of those ideas that sounds great in principle, but where the details get very messy. I don’t have a good enough understanding of Bayesian statistics to make the argument in full, but I do know that very smart people have tried to combine it with causal models and concluded that it doesn’t work. Bayesianism therefore plays essentially no role in the causal modelling literature. If you believe you have an obvious solution to this, I recommend you write it up and submit to a journal, because you will get a very impactful publication out of it.
In a country where nobody plays Russian roulette, you have valid data on the distribution of outcomes under the scenario where nobody plays Russian roulette (due to simple consistency). In combination with knowledge about the survival ratio, this is sufficient to make a prediction for the distribution of outcomes in a counterfactual where everybody plays Russian roulette.
Do you know of any references on the problems people have run into? I’ve used Bayesian inference on causal models in my own day-to-day work quite a bit without running into any fundamental issues (other than computational difficulty), and what I’ve read of people using them in neuroscience and ML generally seems to match that. So it sounds like knowledge has failed to diffuse—either the folks using this stuff haven’t heard about some class of problems with it, or the causal modelling folks are insufficiently steeped in Bayesian inference to handle the tricky bits.
I don’t have a great reference for this.
A place to start might be Judea Pearl’s essay “Why I’m only half-Bayesian” at https://ftp.cs.ucla.edu/pub/stat_ser/r284-reprint.pdf . If you look at his Twitter account at @yudapearl, you will also see numerous tweets where he refers to Bayes Theorem as a “trivial identity” and where he talks about Bayesian statistics as “spraying priors on everything”. See for example https://twitter.com/yudapearl/status/1143118757126000640 and his discussions with Frank Harrell.
Another good read may be Robins, Hernan and Wasserman’s letter to the editor at Biometrics, https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4667748/ . While that letter is not about graphical models, the propensity scores/marginal structural models are mathematically very closely related. The main argument in that letter (which was originally a blog post) has been discussed on Less Wrong before; I am trying to find the discussion, it may be this link https://www.lesswrong.com/posts/xdh5FPMYYGGX7PBKj/the-trouble-with-bayes-draft
From my perspective, as someone who is not well trained in Bayesian methods and does not pretend to understand the issue well, I just observe that methodological work on causal models very rarely uses Bayesian statistics, that I myself do not see an obvious way to integrate it, and that most of the smart people working on causal inference appear to be skeptical of such attempts
Ok, after reading these, it’s sounding a lot more like the main problem is causal inference people not being very steeped in Bayesian inference. Robins, Hernan and Wasserman’s argument is based on a mistake that took all of ten minutes to spot: they show that a particular quantity is independent of propensity score function if the true parameters of the model are known, then jump to the estimate of that quantity being independent of propensity—when in fact, the estimate is dependent on the propensity, because the estimates of the model parameters depend on propensity. Pearl’s argument is more abstract and IMO stronger, but is based on the idea that causal relationships are not statistically testable… when in fact, that’s basically the bread-and-butter use-case for Bayesian model comparison.
Some time in the next week I’ll write up a post with a few full examples (including the one from Robins, Hernan and Wasserman), and explain in a bit more detail.
(Side note: I suspect that the reason smart people have had so much trouble here is that the previous generation was mostly introduced to Bayesian statistics by Savage or Gelman; I expect someone who started with Jaynes would have a lot less trouble here, but his main textbook is relatively recent.)
I look forward to reading it. To be honest: Knowing these authors, I’d be surprised if you have found an error that breaks their argument.
We are now discussing questions that are so far outside of my expertise that I do not have the ability to independently evaluate the arguments, so I am unlikely to contribute further to this particular subthread (i.e. to the discussion about whether there exists an obvious and superior Bayesian solution to the problem I am trying to solve).