But does the number of causal relationships go up just as fast? I don’t think so (although at the moment I can’t prove it).
I am not sure exactly what you mean, but I can think of a formalization where this is not hard to show. We say A “structurally causes” B in a DAG G if and only if there is a directed path from A to B in G. We say A is “structurally dependent” with B in a DAG G if and only if there is a marginal d-connecting path from A to B in G.
A marginal d-connecting path between two nodes is a path with no consecutive edges of the form → ← * (that is, no colliders on the path). In other words all directed paths are marginal d-connecting but the opposite isn’t true.
The justification for this definition is that if A “structurally causes” B in a DAG G, then if we were to intervene on A, we would observe B change (but not vice versa) in “most” distributions that arise from causal structures consistent with G. Similarly, if A and B are “structurally dependent” in a DAG G, then in “most” distributions consistent with G, A and B would be marginally dependent (e.g. what you probably mean when you say ‘correlations are there’).
I qualify with “most” because we cannot simultaneously represent dependences and independences by a graph, so we have to choose. People have chosen to represent independences. That is, if in a DAG G some arrow is missing, then in any distribution (causal structure) consistent with G, there is some sort of independence (missing effect). But if the arrow is not missing we cannot say anything. Maybe there is dependence, maybe there is independence. An arrow may be present in G, and there may still be independence in a distribution consistent with G. We call such distributions “unfaithful” to G. If we pick distributions consistent with G randomly, we are unlikely to hit on unfaithful ones (subset of all distributions consistent with G that is unfaithful to G has measure zero), but Nature does not pick randomly.. so unfaithful distributions are a worry. They may arise for systematic reasons (maybe equilibrium of a feedback process in bio?)
If you accept above definition, then clearly for a DAG with n vertices, the number of pairwise structural dependence relationships is an upper bound on the number of pairwise structural causal relationships. I am not aware of anyone having worked out the exact combinatorics here, but it’s clear there are many many more paths for structural dependence than paths for structural causality.
But what you actually want is not a DAG with n vertices, but another type of graph with n vertices. The “Universe DAG” has a lot of vertices, but what we actually observe is a very small subset of these vertices, and we marginalize over the rest. The trouble is, if you start with a distribution that is consistent with a DAG, and you marginalize over some things, you may end up with a distribution that isn’t well represented by a DAG. Or “DAG models aren’t closed under marginalization.”
That is, if our DAG is A → B ← H → C ← D, and we marginalize over H because we do not observe H, what we get is a distribution where no DAG can properly represent all conditional independences. We need another kind of graph.
In fact, people have come up with a mixed graph (containing → arrows and <-> arrows) to represent margins of DAGs. Here → means the same as in a causal DAG, but <-> means “there is some sort of common cause/confounder that we don’t want to explicitly write down.” Note: <-> is not a correlative arrow, it is still encoding something causal (the presence of a hidden common cause or causes). I am being loose here—in fact it is the absence of arrows that means things, not the presence.
I do a lot of work on these kinds of graphs, because these are graphs are the sensible representation of data we typically get—drawn from a marginal of a joint distribution consistent with a big unknown DAG.
But the combinatorics work out the same in these graphs—the number of marginal d-connected paths is much bigger than the number of directed paths. This is probably the source of your intuition. Of course what often happens is you do have a (weak) causal link between A and B, but a much stronger non-causal link between A and B through an unobserved common parent. So the causal link is hard to find without “tricks.”
The dependence that arises from a conditioned common effect (simplest case A → [C] ← B) that people have brought up does arise in practice, usually if your samples aren’t independent. Typical case: phone surveys are only administered to people with phones. Or case control studies for rare diseases need to gather one arm from people who are actually already sick (called “outcome dependent sampling.”)
Sterner measures might be needed: could we draw causal nets with not just arrows showing influence but also
another kind of arrow showing correlations?
Phil Dawid works with DAG models that are partially causal and partially statistical. But I think we should first be very very clear on exactly what a statistical DAG model is, and what a causal DAG model is, and how they are different. Then we could start combining without confusion!
If you have a prior over DAG/mixed graph structures because you are Bayesian, you can obviously have beliefs about a causal relationship between A and B vs a dependent relationship between A and B, and update your beliefs based on evidence, etc.. Bayesian reasoning about causality does involve saying at some point “I have an assumption that is letting me draw causal conclusions from a fact I observed about a joint distribution,” which is not a trivial step (this is not unique to B of course—anyone who wants to do causality from observational data has to deal with this).
what’s the psychology of this?
Pearl has this hypothesis that a lot of probabilistic fallacies/paradoxes/biases are due to the fact that causal and not probabilistic relationships are what our brain natively thinks about. So e.g. Simpson’s paradox is surprising because we intuitively think of a conditional distribution (where conditioning can change anything!) as a kind of “interventional distribution” (no Simpson’s type reversal under interventions: http://ftp.cs.ucla.edu/pub/stat_ser/r414.pdf).
This hypothesis would claim that people who haven’t looked into the math just interpret statements about conditional probabilities as about “interventional probabilities” (or whatever their intuitive analogue of a causal thing is).
I qualify with “most” because we cannot simultaneously represent dependences and independences by a graph, so we have to choose. People have chosen to represent independences. That is, if in a DAG G some arrow is missing, then in any distribution (causal structure) consistent with G, there is some sort of independence (missing effect).
You probably did not intend to imply that this was an arbitrary choice, but it would still be interesting to hear your thoughts on it. It seems to me that the choice to represent independences by missing arrows was necessary. If they had instead chosen to represent dependences by present arrows, I don’t see how the graphs would be useful for causal inference.
If missing arrows represent independences and the backdoor criterion holds, this is interpreted as “for all distributions that are consistent with the model, there is no confounding”. This is clearly very useful. If arrows represented dependences, it would instead be interpreted as “For at least one distribution that is consistent with the DAG model, there is no confounding”. This is not useful to the investigator.
Since unconfoundedness is an independence-relation, it is not clear to me how graphs that encode dependence-relations would be useful. Can you think of a graphical criterion for unconfoundedness in dependence graphs? Or would dependence graphs be useful for a different purpose?
Hi, thanks for this. I agree that this choice was not arbitrary at all!
There are a few related reasons why it was made.
(a) Pearl wisely noted that it is independences that we exploit for things like propagating beliefs around a sparse graph in polynomial time. When he was still arguing for the use of probability in AI, people in AI were still not fully on board, because they thought that to probabilistically reason about n binary variables we need a 2^n table for the joint, which is a non-starter (of course statisticians were on board w/ probability for hundreds of years even though they didn’t have computers—their solution was to use clever parametric models. In some sense Bayesian networks are just another kind of clever parametric model that finally penetrated the AI culture in the late 80s).
(b) We can define statistical (causal) models by either independences or dependences, but there is a lack of symmetry here that the symmetry of the “presence or absence of edges in a graph” masks. An independence is about a small part of the parameter space. That is, a model defined by an independence will correspond to a manifold of smaller dimension generally that sits in a space corresponding to a saturated model (no constraints). A model defined by dependences will just be that same space with a “small part” missing. Lowering dimension in a model is really nice in stats for a number of reasons.
(c) While conceivably we might be interested in a presence of a causal effect more than an absence of a causal effect, you are absolutely right that generally assumptions that allow us to equate a causal effect with some functional of observed data take the form of equality constraints (e.g. “independences in something.”) So it is much more useful to represent that even if we care about the presence of an effect at the end of the day. We can just see how far from null the final effect number is—we don’t need a graphical representation. However a graphical representation for assumptions we are exploiting to get the effect as a functional of observed data is very handy—this is what eventually led Jin Tian to his awesome identification algorithm on graphs.
(d) There is an interesting logical structure to conditional independence, e.g. Phil Dawid’s graphoid axioms. There is something like that for dependences (Armstrong’s axioms for functional dependence in db theory?) but the structure isn’t as rich.
edit: there are actually only two semi-graphoids : one for symmetry and one for chain rule.
edit^2: graphoids are not complete (because conditional independence is actually kind of a nasty relation). But at least it’s a ternary relation. There are far worse dragons in the cave of “equality constraints.”
I tried to read that, but I think I didn’t understand too much of it or its connection to this topic. I’ll save that whole festschrift for later, there were some interesting titles in the table of contents.
I am not sure exactly what you mean, but I can think of a formalization where this is not hard to show.
I agree I did sort of conflate causal networks and Bayesian networks in general… I didn’t realize there was no clean way of having both at the same time.
It might help if I describe a concrete way to test my claim using just causal networks: generate a randomly connected causal network with x nodes and y arrows, where each arrow has some random noise in it; count how many pairs of nodes are in a causal relationship; now, 1000 times initialize the root nodes to random values and generate a possible state of the network & storing the values for each node; count how many pairwise correlations there are between all the nodes using the 1000 samples (using an appropriate significance test & alpha if one wants); divide # of causal relationships by # of correlations, store; return to the beginning and resume with x+1 nodes and y+1 arrows… As one graphs each x against its respective estimated fraction, does the fraction head toward 0 as x increases? My thesis is it does.
So e.g. Simpson’s paradox is surprising because we intuitively think of a conditional distribution (where conditioning can change anything!) as a kind of “interventional distribution” (no Simpson’s type reversal under interventions: http://ftp.cs.ucla.edu/pub/stat_ser/r414.pdf).
Interesting, and it reminds me of what happens in physics classes: people learn how to memorize teachers’ passwords, but go on thinking in folk-Aristotelian physics fashion, as revealed by simple multiple-choice tests designed to hone in on the appealing folk-physics misconceptions vs ‘unnatural’ Newtonian mechanics. That’s a plausible explanation, but I wonder if anyone has established more directly that people really do reason causally even when they know they’re not supposed to? Offhand, it doesn’t really sound like any bias I can think of. It shouldn’t be too hard to develop such a test for teachers of causality material, just take common student misconceptions or deadends and refine them into a multiple-choice test. I’d bet stats 101 courses have as much problems as intro physics courses.
I am not sure exactly what you mean, but I can think of a formalization where this is not hard to show. We say A “structurally causes” B in a DAG G if and only if there is a directed path from A to B in G. We say A is “structurally dependent” with B in a DAG G if and only if there is a marginal d-connecting path from A to B in G. A marginal d-connecting path between two nodes is a path with no consecutive edges of the form → ← * (that is, no colliders on the path). In other words all directed paths are marginal d-connecting but the opposite isn’t true.
That seems to make sense to me.
The justification for this definition is that if A “structurally causes” B in a DAG G, then if we were to intervene on A, we would observe B change (but not vice versa) in “most” distributions that arise from causal structures consistent with G. Similarly, if A and B are “structurally dependent” in a DAG G, then in “most” distributions consistent with G, A and B would be marginally dependent (e.g. what you probably mean when you say ‘correlations are there’).
I’m not sure about marginal dependence.
I qualify with “most” because we cannot simultaneously represent dependences and independences by a graph, so we have to choose. People have chosen to represent independences. That is, if in a DAG G some arrow is missing, then in any distribution (causal structure) consistent with G, there is some sort of independence (missing effect). But if the arrow is not missing we cannot say anything. Maybe there is dependence, maybe there is independence. An arrow may be present in G, and there may still be independence in a distribution consistent with G. We call such distributions “unfaithful” to G. If we pick distributions consistent with G randomly, we are unlikely to hit on unfaithful ones (subset of all distributions consistent with G that is unfaithful to G has measure zero), but Nature does not pick randomly.. so unfaithful distributions are a worry.
I’m afraid I don’t understand you here. If we draw an arrow from A to B, either as a causal or Bayesian net, because we’ve observed correlation or causation (maybe we actually randomized A for once), how can there not be a relationship in any underlying reality and there actually be an ‘independence’ and the graph be ‘unfaithful’?
Anyway, it seems that either way, there might be something to this idea. I’ll keep it in mind for the future.
Hi, gwern it’s awesome you are grappling with these issues. Here are some jambling responses.
You might enjoy Sander Greenland’s essay here:
http://bayes.cs.ucla.edu/TRIBUTE/festschrift-complete.pdf
Sander can be pretty bleak!
I am not sure exactly what you mean, but I can think of a formalization where this is not hard to show. We say A “structurally causes” B in a DAG G if and only if there is a directed path from A to B in G. We say A is “structurally dependent” with B in a DAG G if and only if there is a marginal d-connecting path from A to B in G.
A marginal d-connecting path between two nodes is a path with no consecutive edges of the form → ← * (that is, no colliders on the path). In other words all directed paths are marginal d-connecting but the opposite isn’t true.
The justification for this definition is that if A “structurally causes” B in a DAG G, then if we were to intervene on A, we would observe B change (but not vice versa) in “most” distributions that arise from causal structures consistent with G. Similarly, if A and B are “structurally dependent” in a DAG G, then in “most” distributions consistent with G, A and B would be marginally dependent (e.g. what you probably mean when you say ‘correlations are there’).
I qualify with “most” because we cannot simultaneously represent dependences and independences by a graph, so we have to choose. People have chosen to represent independences. That is, if in a DAG G some arrow is missing, then in any distribution (causal structure) consistent with G, there is some sort of independence (missing effect). But if the arrow is not missing we cannot say anything. Maybe there is dependence, maybe there is independence. An arrow may be present in G, and there may still be independence in a distribution consistent with G. We call such distributions “unfaithful” to G. If we pick distributions consistent with G randomly, we are unlikely to hit on unfaithful ones (subset of all distributions consistent with G that is unfaithful to G has measure zero), but Nature does not pick randomly.. so unfaithful distributions are a worry. They may arise for systematic reasons (maybe equilibrium of a feedback process in bio?)
If you accept above definition, then clearly for a DAG with n vertices, the number of pairwise structural dependence relationships is an upper bound on the number of pairwise structural causal relationships. I am not aware of anyone having worked out the exact combinatorics here, but it’s clear there are many many more paths for structural dependence than paths for structural causality.
But what you actually want is not a DAG with n vertices, but another type of graph with n vertices. The “Universe DAG” has a lot of vertices, but what we actually observe is a very small subset of these vertices, and we marginalize over the rest. The trouble is, if you start with a distribution that is consistent with a DAG, and you marginalize over some things, you may end up with a distribution that isn’t well represented by a DAG. Or “DAG models aren’t closed under marginalization.”
That is, if our DAG is A → B ← H → C ← D, and we marginalize over H because we do not observe H, what we get is a distribution where no DAG can properly represent all conditional independences. We need another kind of graph.
In fact, people have come up with a mixed graph (containing → arrows and <-> arrows) to represent margins of DAGs. Here → means the same as in a causal DAG, but <-> means “there is some sort of common cause/confounder that we don’t want to explicitly write down.” Note: <-> is not a correlative arrow, it is still encoding something causal (the presence of a hidden common cause or causes). I am being loose here—in fact it is the absence of arrows that means things, not the presence.
I do a lot of work on these kinds of graphs, because these are graphs are the sensible representation of data we typically get—drawn from a marginal of a joint distribution consistent with a big unknown DAG.
But the combinatorics work out the same in these graphs—the number of marginal d-connected paths is much bigger than the number of directed paths. This is probably the source of your intuition. Of course what often happens is you do have a (weak) causal link between A and B, but a much stronger non-causal link between A and B through an unobserved common parent. So the causal link is hard to find without “tricks.”
The dependence that arises from a conditioned common effect (simplest case A → [C] ← B) that people have brought up does arise in practice, usually if your samples aren’t independent. Typical case: phone surveys are only administered to people with phones. Or case control studies for rare diseases need to gather one arm from people who are actually already sick (called “outcome dependent sampling.”)
Phil Dawid works with DAG models that are partially causal and partially statistical. But I think we should first be very very clear on exactly what a statistical DAG model is, and what a causal DAG model is, and how they are different. Then we could start combining without confusion!
If you have a prior over DAG/mixed graph structures because you are Bayesian, you can obviously have beliefs about a causal relationship between A and B vs a dependent relationship between A and B, and update your beliefs based on evidence, etc.. Bayesian reasoning about causality does involve saying at some point “I have an assumption that is letting me draw causal conclusions from a fact I observed about a joint distribution,” which is not a trivial step (this is not unique to B of course—anyone who wants to do causality from observational data has to deal with this).
Pearl has this hypothesis that a lot of probabilistic fallacies/paradoxes/biases are due to the fact that causal and not probabilistic relationships are what our brain natively thinks about. So e.g. Simpson’s paradox is surprising because we intuitively think of a conditional distribution (where conditioning can change anything!) as a kind of “interventional distribution” (no Simpson’s type reversal under interventions: http://ftp.cs.ucla.edu/pub/stat_ser/r414.pdf).
This hypothesis would claim that people who haven’t looked into the math just interpret statements about conditional probabilities as about “interventional probabilities” (or whatever their intuitive analogue of a causal thing is).
Good comment—upvoted. Just a minor question:
You probably did not intend to imply that this was an arbitrary choice, but it would still be interesting to hear your thoughts on it. It seems to me that the choice to represent independences by missing arrows was necessary. If they had instead chosen to represent dependences by present arrows, I don’t see how the graphs would be useful for causal inference.
If missing arrows represent independences and the backdoor criterion holds, this is interpreted as “for all distributions that are consistent with the model, there is no confounding”. This is clearly very useful. If arrows represented dependences, it would instead be interpreted as “For at least one distribution that is consistent with the DAG model, there is no confounding”. This is not useful to the investigator.
Since unconfoundedness is an independence-relation, it is not clear to me how graphs that encode dependence-relations would be useful. Can you think of a graphical criterion for unconfoundedness in dependence graphs? Or would dependence graphs be useful for a different purpose?
Hi, thanks for this. I agree that this choice was not arbitrary at all!
There are a few related reasons why it was made.
(a) Pearl wisely noted that it is independences that we exploit for things like propagating beliefs around a sparse graph in polynomial time. When he was still arguing for the use of probability in AI, people in AI were still not fully on board, because they thought that to probabilistically reason about n binary variables we need a 2^n table for the joint, which is a non-starter (of course statisticians were on board w/ probability for hundreds of years even though they didn’t have computers—their solution was to use clever parametric models. In some sense Bayesian networks are just another kind of clever parametric model that finally penetrated the AI culture in the late 80s).
(b) We can define statistical (causal) models by either independences or dependences, but there is a lack of symmetry here that the symmetry of the “presence or absence of edges in a graph” masks. An independence is about a small part of the parameter space. That is, a model defined by an independence will correspond to a manifold of smaller dimension generally that sits in a space corresponding to a saturated model (no constraints). A model defined by dependences will just be that same space with a “small part” missing. Lowering dimension in a model is really nice in stats for a number of reasons.
(c) While conceivably we might be interested in a presence of a causal effect more than an absence of a causal effect, you are absolutely right that generally assumptions that allow us to equate a causal effect with some functional of observed data take the form of equality constraints (e.g. “independences in something.”) So it is much more useful to represent that even if we care about the presence of an effect at the end of the day. We can just see how far from null the final effect number is—we don’t need a graphical representation. However a graphical representation for assumptions we are exploiting to get the effect as a functional of observed data is very handy—this is what eventually led Jin Tian to his awesome identification algorithm on graphs.
(d) There is an interesting logical structure to conditional independence, e.g. Phil Dawid’s graphoid axioms. There is something like that for dependences (Armstrong’s axioms for functional dependence in db theory?) but the structure isn’t as rich.
edit: there are actually only two semi-graphoids : one for symmetry and one for chain rule.
edit^2: graphoids are not complete (because conditional independence is actually kind of a nasty relation). But at least it’s a ternary relation. There are far worse dragons in the cave of “equality constraints.”
Thanks for reading.
I tried to read that, but I think I didn’t understand too much of it or its connection to this topic. I’ll save that whole festschrift for later, there were some interesting titles in the table of contents.
I agree I did sort of conflate causal networks and Bayesian networks in general… I didn’t realize there was no clean way of having both at the same time.
It might help if I describe a concrete way to test my claim using just causal networks: generate a randomly connected causal network with x nodes and y arrows, where each arrow has some random noise in it; count how many pairs of nodes are in a causal relationship; now, 1000 times initialize the root nodes to random values and generate a possible state of the network & storing the values for each node; count how many pairwise correlations there are between all the nodes using the 1000 samples (using an appropriate significance test & alpha if one wants); divide # of causal relationships by # of correlations, store; return to the beginning and resume with x+1 nodes and y+1 arrows… As one graphs each x against its respective estimated fraction, does the fraction head toward 0 as x increases? My thesis is it does.
Interesting, and it reminds me of what happens in physics classes: people learn how to memorize teachers’ passwords, but go on thinking in folk-Aristotelian physics fashion, as revealed by simple multiple-choice tests designed to hone in on the appealing folk-physics misconceptions vs ‘unnatural’ Newtonian mechanics. That’s a plausible explanation, but I wonder if anyone has established more directly that people really do reason causally even when they know they’re not supposed to? Offhand, it doesn’t really sound like any bias I can think of. It shouldn’t be too hard to develop such a test for teachers of causality material, just take common student misconceptions or deadends and refine them into a multiple-choice test. I’d bet stats 101 courses have as much problems as intro physics courses.
That seems to make sense to me.
I’m not sure about marginal dependence.
I’m afraid I don’t understand you here. If we draw an arrow from A to B, either as a causal or Bayesian net, because we’ve observed correlation or causation (maybe we actually randomized A for once), how can there not be a relationship in any underlying reality and there actually be an ‘independence’ and the graph be ‘unfaithful’?
Anyway, it seems that either way, there might be something to this idea. I’ll keep it in mind for the future.