But does the number of causal relationships go up just as fast? I don’t think so (although at the moment I can’t prove it).
I am not sure exactly what you mean, but I can think of a formalization where this is not hard to show. We say A “structurally causes” B in a DAG G if and only if there is a directed path from A to B in G. We say A is “structurally dependent” with B in a DAG G if and only if there is a marginal d-connecting path from A to B in G.
A marginal d-connecting path between two nodes is a path with no consecutive edges of the form → ← * (that is, no colliders on the path). In other words all directed paths are marginal d-connecting but the opposite isn’t true.
The justification for this definition is that if A “structurally causes” B in a DAG G, then if we were to intervene on A, we would observe B change (but not vice versa) in “most” distributions that arise from causal structures consistent with G. Similarly, if A and B are “structurally dependent” in a DAG G, then in “most” distributions consistent with G, A and B would be marginally dependent (e.g. what you probably mean when you say ‘correlations are there’).
I qualify with “most” because we cannot simultaneously represent dependences and independences by a graph, so we have to choose. People have chosen to represent independences. That is, if in a DAG G some arrow is missing, then in any distribution (causal structure) consistent with G, there is some sort of independence (missing effect). But if the arrow is not missing we cannot say anything. Maybe there is dependence, maybe there is independence. An arrow may be present in G, and there may still be independence in a distribution consistent with G. We call such distributions “unfaithful” to G. If we pick distributions consistent with G randomly, we are unlikely to hit on unfaithful ones (subset of all distributions consistent with G that is unfaithful to G has measure zero), but Nature does not pick randomly.. so unfaithful distributions are a worry. They may arise for systematic reasons (maybe equilibrium of a feedback process in bio?)
If you accept above definition, then clearly for a DAG with n vertices, the number of pairwise structural dependence relationships is an upper bound on the number of pairwise structural causal relationships. I am not aware of anyone having worked out the exact combinatorics here, but it’s clear there are many many more paths for structural dependence than paths for structural causality.
But what you actually want is not a DAG with n vertices, but another type of graph with n vertices. The “Universe DAG” has a lot of vertices, but what we actually observe is a very small subset of these vertices, and we marginalize over the rest. The trouble is, if you start with a distribution that is consistent with a DAG, and you marginalize over some things, you may end up with a distribution that isn’t well represented by a DAG. Or “DAG models aren’t closed under marginalization.”
That is, if our DAG is A → B ← H → C ← D, and we marginalize over H because we do not observe H, what we get is a distribution where no DAG can properly represent all conditional independences. We need another kind of graph.
In fact, people have come up with a mixed graph (containing → arrows and <-> arrows) to represent margins of DAGs. Here → means the same as in a causal DAG, but <-> means “there is some sort of common cause/confounder that we don’t want to explicitly write down.” Note: <-> is not a correlative arrow, it is still encoding something causal (the presence of a hidden common cause or causes). I am being loose here—in fact it is the absence of arrows that means things, not the presence.
I do a lot of work on these kinds of graphs, because these are graphs are the sensible representation of data we typically get—drawn from a marginal of a joint distribution consistent with a big unknown DAG.
But the combinatorics work out the same in these graphs—the number of marginal d-connected paths is much bigger than the number of directed paths. This is probably the source of your intuition. Of course what often happens is you do have a (weak) causal link between A and B, but a much stronger non-causal link between A and B through an unobserved common parent. So the causal link is hard to find without “tricks.”
The dependence that arises from a conditioned common effect (simplest case A → [C] ← B) that people have brought up does arise in practice, usually if your samples aren’t independent. Typical case: phone surveys are only administered to people with phones. Or case control studies for rare diseases need to gather one arm from people who are actually already sick (called “outcome dependent sampling.”)
Sterner measures might be needed: could we draw causal nets with not just arrows showing influence but also
another kind of arrow showing correlations?
Phil Dawid works with DAG models that are partially causal and partially statistical. But I think we should first be very very clear on exactly what a statistical DAG model is, and what a causal DAG model is, and how they are different. Then we could start combining without confusion!
If you have a prior over DAG/mixed graph structures because you are Bayesian, you can obviously have beliefs about a causal relationship between A and B vs a dependent relationship between A and B, and update your beliefs based on evidence, etc.. Bayesian reasoning about causality does involve saying at some point “I have an assumption that is letting me draw causal conclusions from a fact I observed about a joint distribution,” which is not a trivial step (this is not unique to B of course—anyone who wants to do causality from observational data has to deal with this).
what’s the psychology of this?
Pearl has this hypothesis that a lot of probabilistic fallacies/paradoxes/biases are due to the fact that causal and not probabilistic relationships are what our brain natively thinks about. So e.g. Simpson’s paradox is surprising because we intuitively think of a conditional distribution (where conditioning can change anything!) as a kind of “interventional distribution” (no Simpson’s type reversal under interventions: http://ftp.cs.ucla.edu/pub/stat_ser/r414.pdf).
This hypothesis would claim that people who haven’t looked into the math just interpret statements about conditional probabilities as about “interventional probabilities” (or whatever their intuitive analogue of a causal thing is).
I qualify with “most” because we cannot simultaneously represent dependences and independences by a graph, so we have to choose. People have chosen to represent independences. That is, if in a DAG G some arrow is missing, then in any distribution (causal structure) consistent with G, there is some sort of independence (missing effect).
You probably did not intend to imply that this was an arbitrary choice, but it would still be interesting to hear your thoughts on it. It seems to me that the choice to represent independences by missing arrows was necessary. If they had instead chosen to represent dependences by present arrows, I don’t see how the graphs would be useful for causal inference.
If missing arrows represent independences and the backdoor criterion holds, this is interpreted as “for all distributions that are consistent with the model, there is no confounding”. This is clearly very useful. If arrows represented dependences, it would instead be interpreted as “For at least one distribution that is consistent with the DAG model, there is no confounding”. This is not useful to the investigator.
Since unconfoundedness is an independence-relation, it is not clear to me how graphs that encode dependence-relations would be useful. Can you think of a graphical criterion for unconfoundedness in dependence graphs? Or would dependence graphs be useful for a different purpose?
Hi, thanks for this. I agree that this choice was not arbitrary at all!
There are a few related reasons why it was made.
(a) Pearl wisely noted that it is independences that we exploit for things like propagating beliefs around a sparse graph in polynomial time. When he was still arguing for the use of probability in AI, people in AI were still not fully on board, because they thought that to probabilistically reason about n binary variables we need a 2^n table for the joint, which is a non-starter (of course statisticians were on board w/ probability for hundreds of years even though they didn’t have computers—their solution was to use clever parametric models. In some sense Bayesian networks are just another kind of clever parametric model that finally penetrated the AI culture in the late 80s).
(b) We can define statistical (causal) models by either independences or dependences, but there is a lack of symmetry here that the symmetry of the “presence or absence of edges in a graph” masks. An independence is about a small part of the parameter space. That is, a model defined by an independence will correspond to a manifold of smaller dimension generally that sits in a space corresponding to a saturated model (no constraints). A model defined by dependences will just be that same space with a “small part” missing. Lowering dimension in a model is really nice in stats for a number of reasons.
(c) While conceivably we might be interested in a presence of a causal effect more than an absence of a causal effect, you are absolutely right that generally assumptions that allow us to equate a causal effect with some functional of observed data take the form of equality constraints (e.g. “independences in something.”) So it is much more useful to represent that even if we care about the presence of an effect at the end of the day. We can just see how far from null the final effect number is—we don’t need a graphical representation. However a graphical representation for assumptions we are exploiting to get the effect as a functional of observed data is very handy—this is what eventually led Jin Tian to his awesome identification algorithm on graphs.
(d) There is an interesting logical structure to conditional independence, e.g. Phil Dawid’s graphoid axioms. There is something like that for dependences (Armstrong’s axioms for functional dependence in db theory?) but the structure isn’t as rich.
edit: there are actually only two semi-graphoids : one for symmetry and one for chain rule.
edit^2: graphoids are not complete (because conditional independence is actually kind of a nasty relation). But at least it’s a ternary relation. There are far worse dragons in the cave of “equality constraints.”
I tried to read that, but I think I didn’t understand too much of it or its connection to this topic. I’ll save that whole festschrift for later, there were some interesting titles in the table of contents.
I am not sure exactly what you mean, but I can think of a formalization where this is not hard to show.
I agree I did sort of conflate causal networks and Bayesian networks in general… I didn’t realize there was no clean way of having both at the same time.
It might help if I describe a concrete way to test my claim using just causal networks: generate a randomly connected causal network with x nodes and y arrows, where each arrow has some random noise in it; count how many pairs of nodes are in a causal relationship; now, 1000 times initialize the root nodes to random values and generate a possible state of the network & storing the values for each node; count how many pairwise correlations there are between all the nodes using the 1000 samples (using an appropriate significance test & alpha if one wants); divide # of causal relationships by # of correlations, store; return to the beginning and resume with x+1 nodes and y+1 arrows… As one graphs each x against its respective estimated fraction, does the fraction head toward 0 as x increases? My thesis is it does.
So e.g. Simpson’s paradox is surprising because we intuitively think of a conditional distribution (where conditioning can change anything!) as a kind of “interventional distribution” (no Simpson’s type reversal under interventions: http://ftp.cs.ucla.edu/pub/stat_ser/r414.pdf).
Interesting, and it reminds me of what happens in physics classes: people learn how to memorize teachers’ passwords, but go on thinking in folk-Aristotelian physics fashion, as revealed by simple multiple-choice tests designed to hone in on the appealing folk-physics misconceptions vs ‘unnatural’ Newtonian mechanics. That’s a plausible explanation, but I wonder if anyone has established more directly that people really do reason causally even when they know they’re not supposed to? Offhand, it doesn’t really sound like any bias I can think of. It shouldn’t be too hard to develop such a test for teachers of causality material, just take common student misconceptions or deadends and refine them into a multiple-choice test. I’d bet stats 101 courses have as much problems as intro physics courses.
I am not sure exactly what you mean, but I can think of a formalization where this is not hard to show. We say A “structurally causes” B in a DAG G if and only if there is a directed path from A to B in G. We say A is “structurally dependent” with B in a DAG G if and only if there is a marginal d-connecting path from A to B in G. A marginal d-connecting path between two nodes is a path with no consecutive edges of the form → ← * (that is, no colliders on the path). In other words all directed paths are marginal d-connecting but the opposite isn’t true.
That seems to make sense to me.
The justification for this definition is that if A “structurally causes” B in a DAG G, then if we were to intervene on A, we would observe B change (but not vice versa) in “most” distributions that arise from causal structures consistent with G. Similarly, if A and B are “structurally dependent” in a DAG G, then in “most” distributions consistent with G, A and B would be marginally dependent (e.g. what you probably mean when you say ‘correlations are there’).
I’m not sure about marginal dependence.
I qualify with “most” because we cannot simultaneously represent dependences and independences by a graph, so we have to choose. People have chosen to represent independences. That is, if in a DAG G some arrow is missing, then in any distribution (causal structure) consistent with G, there is some sort of independence (missing effect). But if the arrow is not missing we cannot say anything. Maybe there is dependence, maybe there is independence. An arrow may be present in G, and there may still be independence in a distribution consistent with G. We call such distributions “unfaithful” to G. If we pick distributions consistent with G randomly, we are unlikely to hit on unfaithful ones (subset of all distributions consistent with G that is unfaithful to G has measure zero), but Nature does not pick randomly.. so unfaithful distributions are a worry.
I’m afraid I don’t understand you here. If we draw an arrow from A to B, either as a causal or Bayesian net, because we’ve observed correlation or causation (maybe we actually randomized A for once), how can there not be a relationship in any underlying reality and there actually be an ‘independence’ and the graph be ‘unfaithful’?
Anyway, it seems that either way, there might be something to this idea. I’ll keep it in mind for the future.
This post is a good example of why LW is dying. Specifically, that it was posted as a comment to a garbage-collector thread in the second-class area. Something is horribly wrong with the selection mechanism for what gets on the front page.
Underconfidence is a sin. This is specifically about gwern’s calibration. (EDIT: or his preferences)
Not everyone is in the same situation. I mean, we recently had an article disproving theory of relativity posted in Main (later moved to Discussion). Texts with less value that gwern’s comment do regularly get posted as articles. So it’s not like everyone is afraid to post anything. Some people should update towards posting articles, some people should update towards trying their ideas in Open Thread first. Maybe they need a little nudge from outside first.
How about adding this text to the Open Thread introduction?
As a rule of thumb, if your top-level comment here received 15 or more karma, you probably should repost it as a separate article—either in the same form, or updated. (And probably post texts of similar quality directly as articles in the future.)
And a more courageous idea: Make a script which collects all top-level Open Thread articles with 15 and more karma from all Open Threads in history and sends their authors a message (one message per author with links to all such comments, to prevent inbox flood) that they should consider posting this as an article.
As a rule of thumb, if your top-level comment here received 15 or more karma, you probably should repost it as a separate article—either in the same form, or updated. (And probably post texts of similar quality directly as articles in the future.)
More generally, I’d say that if it’s longer than about four paragraphs, it’s probably better suited as its own article than as a comment.
Let’s talk about the fact that the top two comments on a very nice contribution in the open thread is about how this is the wrong place for the post, or how it is why LW is dying. Actually let’s not talk about that.
A simple informal reason for giving a small probability to “A causes B” could be this:
For any fixed values A and B, there is only one explanation “A causes B”, one explanation “B causes A”, and many explanations “C causes A and B” (for many different values of C). If we split 100% between all these explanations, the last group gets most of the probability mass.
And as you said, the more complex given field is, the more realistic values of C there are.
You are leaving out one type of causation. It is possible that you are conditioning on some common effect C of A and B.
You may argue that this does not actually give a correlation between A and B, it only give a correlation between A given C and B given C. However, in real life, there will always be things you condition on whenever you collect data, so you cannot completely remove this possibility.
This would be the same sort of selection-causing-pseudo-correlations that Yvain discusses in http://slatestarcodex.com/2014/03/01/searching-for-one-sided-tradeoffs/ ? Hm… I think I would lump that in with my response to Nancy (‘yes, we rarely have huge _n_s and can wish away sampling error and yes our data collection is usually biased or conditioned in ways we don’t know but let’s ignore that to look at the underlying stuff’).
Even if we promoted it to the level of the other 3 causation patterns, does that change any of my arguments? It seems like another way of producing correlations-which-aren’t-due-to-direct-causation just emphasizes the point.
Yes, I am talking about the exact same thing that Yvain is talking about there.
So, I think any time you observe a correlation, it is because of one of those 4 causation patterns, so even if the fourth does not show up as regularly as the other 3, you should include it for completeness.
Regarding the psychology of why people overestimate the correlation-causation link, I was just recently reading this, and something vaguely relevant struck my eye:
Later, Johnson-Laird put forward the theory that individuals reason by carrying out three fundamental steps [21]:
They imagine a state of affairs in which the premises are true – i.e. they construct a mental model of them.
They formulate, if possible, an informative conclusion true in the model.
They check for an alternative model of the premises in which the putative conclusion is false.
If there is no such model, then the conclusion is a valid inference from the premises.
Johnson-Laird and Steedman implemented the theory in a computer program that made deductions from singly-quantified assertions, and its predictions about the relative difficulty of such problems were strikingly confirmed: the greater the number of models that have to be constructed in order to draw the correct conclusion, the harder the task [25]. Johnson-Laird concluded [22] that comprehension is a process of constructing a mental model, and set out his theory in an influential book [23]. Since then he has applied the idea to reasoning about Boolean circuitry [3] and to reasoning in modal logic [24].
It’s hard to say because how would you measure this other than directly, and to measure this directly you need a clear set of correlations which are proposed to be causal, randomized experiments to establish what the true causal relationship is, and both categories need to be sharply delineated in advance to avoid issues of cherrypicking and retroactively confirming a correlation so you can say something like ’11 out of the 100 proposed A->B causal relationships panned out’. This is pretty rare, although the few examples I’ve found from medicine tend to indicate under 10%. Not great. And we can’t explain all of this away as the result of illusory correlations being throw up by the standard statistical problems with findings such as small n/sampling error, selection bias, publication bias, etc.
Say you do this, and you find that about 10% of all correlations in a dataset are shown to have a causal link. Can you then look for a correlation between certain aspects of a correlation (such as coefficient, field of study) and those correlations which are causal?
Building on this, you might establish something like “correlations at .9 are more likely to be causal than correlations at .7” and establish a causal mechanism for this. Alternatively, you might find that “correlations from the field of farkology are more often causal than correlations from spleen medicine”, and find a causal explanation for this.
Part or all of this explanation might involve the size of the causal network. It could well be that both correlation coefficients and field of study are just proxy variables to describe the size of a network, and that’s the only important factor in the ratio of correlations to causal links, but it might be the case that there is more to it.
This could lead to quite a bit of trouble in academic literature, as measures of what evidence a correlation is for causation will become dependent on a set of variables about the context you’re working in, and this could potentially be gamed. In fact, that could be the case even with gwern’s original proposition—claiming you’re working with a small causal net could be enough to lend strong evidence to a causal claim based on correlation, and it’s only by having someone point out that your causal net is lacking that this evidence can have its weighting adjusted.
All these thoughts are sketchy outlines of an extension of what gwern’s brought up. More considered comment is welcome.
Building on this, you might establish something like “correlations at .9 are more likely to be causal than correlations at .7” and establish a causal mechanism for this. Alternatively, you might find that “correlations from the field of farkology are more often causal than correlations from spleen medicine”, and find a causal explanation for this.
I would be very surprised if this was not the case. Different fields already use different cutoffs for statistical-significance (you might get away with p<0.05 in psychology, but particle physics likes its five-sigmas, and in genomics the cutoff will be hundreds or thousands of times smaller and vary heavily based on what exactly you’re analyzing) and likewise have different expectations for effect sizes (psychology expects large effects, medicine expects medium effects, and genomics expects very small effects; eg for genetic influence on IQ, any claim of a allele with an effect larger than d=0.06 should be greeted with surprise and alarm).
Part or all of this explanation might involve the size of the causal network. It could well be that both correlation coefficients and field of study are just proxy variables to describe the size of a network, and that’s the only important factor in the ratio of correlations to causal links, but it might be the case that there is more to it.
I think that there is going to be a relationship, but it’ll be hard to describe precisely. Suppose we correlated A and B and found r=0.9. This is a large correlation by most fields’ standards, and it would seem to put constraints on the causal net that A and B are part of: either there aren’t many nodes ‘in between’ A and B (because each node is a chance for the correlation to diminish and be lost in influence from all the neighboring nodes, with their own connections) or the nodes are powerfully correlated so the net correlation can still be as high as 0.9.
This could lead to quite a bit of trouble in academic literature, as measures of what evidence a correlation is for causation will become dependent on a set of variables about the context you’re working in, and this could potentially be gamed. In fact, that could be the case even with gwern’s original proposition—claiming you’re working with a small causal net could be enough to lend strong evidence to a causal claim based on correlation, and it’s only by having someone point out that your causal net is lacking that this evidence can have its weighting adjusted.
To a large extent, this is already the case (see above). People justify results with relation to implicit models and supposed analysis procedures (‘we did reported t-test so we are entitled to declare p<0.05 statistically-significant (never mind all the tweaks we tried and interim tests while collecting data)’). The existing defaults aren’t usually well-justified: for example, why does psychology use 0.05 rather than 0.10 or 0.01? ‘Surely God loves p=0.06 almost as much as he loves the p=0.05’ one line goes.
I would be very surprised if this was not the case. Different fields already use different cutoffs for statistical-significance (you might get away with p<0.05 in psychology, but particle physics likes its five-sigmas, and in genomics the cutoff will be hundreds or thousands of times smaller and vary heavily based on what exactly you’re analyzing) and likewise have different expectations for effect sizes (psychology expects large effects, medicine expects medium effects, and genomics expects very small effects; eg for genetic influence on IQ, any claim of a allele with an effect larger than d=0.06 should be greeted with surprise and alarm).
The existing defaults aren’t usually well-justified: for example, why does psychology use 0.05 rather than 0.10 or 0.01?
This is a good point, and leads to what might be an interesting use of the experimental approach of linking correlations to causation: gauging whether the heuristics currently in use in a field are at a suitable level/reflect the degree to which correlation is evidence for causation.
If you were to find, for example, that physics is churning out huge sigmas where it doesn’t really need to, or psychology really really needs to up its standards of evidence (not that that in itself would be a surprising result), those could be very interesting results.
Of course, to run these experiments you need large samples of well-researched correlations you can easily and objectively test for causality, from all the fields you’re looking at, which is no small requirement.
If the falling price of gene sequencing lets us determine a lot about how genes influence human behavior social scientists, I predict, will get a lot better at figuring out the causal effects of social programs.
Better genetic analysis will make it easier to discuss politically incorrect topics because rather than talking about IQ you could discuss complex gene clusters characterized by hard to understand mathematical correlations. And I strongly suspect that with a better understanding of genetics race would become much less significant in statistical analysis because after you account for genetics you would gain little statistical significance by directly adding race into a regression (i.e. if gene X does something important and 80% of Asians but only 5% of whites have the gene then without genetic analysis race is important but after you know who has the gene race isn’t statistically significant.)
I’m choosing to ignore that possibility to clarify the exposition of what I think is going on. Problems like that are what I’m referring to when I preface:
And we can’t explain all of this away as the result of illusory correlations being throw up by the standard statistical problems with findings such as small n/sampling error, selection bias, publication bias, etc.
Even if we had enormous clean datasets showing correlations to whatever level of statistical-significance you please, you still can’t spin the straw of correlation into the gold of causation, and the question remains why.
You could say that “A and B happen to be in sync for a while” is possibility 3, where C is the passage of time. (Unless by “happen to be in sync for a while” you mean that they appear to be correlated because of a fluke.)
To generalize, it’s also possible that you’re observing survivor effects, i.e., both A and not B (or B and not A) cause the data to appear in your data set.
When it comes to the replication of those breakthrough cancer results I think you can’t forget publication bias. A lab runs an experiment 6 times. If it produces results in one of those trials they write a paper.
EDIT: I’ve removed this draft & posted a longer version incorporating some of the feedback here at http://lesswrong.com/lw/khd/confound_it_correlation_is_usually_not_causation/
I would prefer posts like that to stand on their own in discussion and not be posted in an open thread.
Hi, gwern it’s awesome you are grappling with these issues. Here are some jambling responses.
You might enjoy Sander Greenland’s essay here:
http://bayes.cs.ucla.edu/TRIBUTE/festschrift-complete.pdf
Sander can be pretty bleak!
I am not sure exactly what you mean, but I can think of a formalization where this is not hard to show. We say A “structurally causes” B in a DAG G if and only if there is a directed path from A to B in G. We say A is “structurally dependent” with B in a DAG G if and only if there is a marginal d-connecting path from A to B in G.
A marginal d-connecting path between two nodes is a path with no consecutive edges of the form → ← * (that is, no colliders on the path). In other words all directed paths are marginal d-connecting but the opposite isn’t true.
The justification for this definition is that if A “structurally causes” B in a DAG G, then if we were to intervene on A, we would observe B change (but not vice versa) in “most” distributions that arise from causal structures consistent with G. Similarly, if A and B are “structurally dependent” in a DAG G, then in “most” distributions consistent with G, A and B would be marginally dependent (e.g. what you probably mean when you say ‘correlations are there’).
I qualify with “most” because we cannot simultaneously represent dependences and independences by a graph, so we have to choose. People have chosen to represent independences. That is, if in a DAG G some arrow is missing, then in any distribution (causal structure) consistent with G, there is some sort of independence (missing effect). But if the arrow is not missing we cannot say anything. Maybe there is dependence, maybe there is independence. An arrow may be present in G, and there may still be independence in a distribution consistent with G. We call such distributions “unfaithful” to G. If we pick distributions consistent with G randomly, we are unlikely to hit on unfaithful ones (subset of all distributions consistent with G that is unfaithful to G has measure zero), but Nature does not pick randomly.. so unfaithful distributions are a worry. They may arise for systematic reasons (maybe equilibrium of a feedback process in bio?)
If you accept above definition, then clearly for a DAG with n vertices, the number of pairwise structural dependence relationships is an upper bound on the number of pairwise structural causal relationships. I am not aware of anyone having worked out the exact combinatorics here, but it’s clear there are many many more paths for structural dependence than paths for structural causality.
But what you actually want is not a DAG with n vertices, but another type of graph with n vertices. The “Universe DAG” has a lot of vertices, but what we actually observe is a very small subset of these vertices, and we marginalize over the rest. The trouble is, if you start with a distribution that is consistent with a DAG, and you marginalize over some things, you may end up with a distribution that isn’t well represented by a DAG. Or “DAG models aren’t closed under marginalization.”
That is, if our DAG is A → B ← H → C ← D, and we marginalize over H because we do not observe H, what we get is a distribution where no DAG can properly represent all conditional independences. We need another kind of graph.
In fact, people have come up with a mixed graph (containing → arrows and <-> arrows) to represent margins of DAGs. Here → means the same as in a causal DAG, but <-> means “there is some sort of common cause/confounder that we don’t want to explicitly write down.” Note: <-> is not a correlative arrow, it is still encoding something causal (the presence of a hidden common cause or causes). I am being loose here—in fact it is the absence of arrows that means things, not the presence.
I do a lot of work on these kinds of graphs, because these are graphs are the sensible representation of data we typically get—drawn from a marginal of a joint distribution consistent with a big unknown DAG.
But the combinatorics work out the same in these graphs—the number of marginal d-connected paths is much bigger than the number of directed paths. This is probably the source of your intuition. Of course what often happens is you do have a (weak) causal link between A and B, but a much stronger non-causal link between A and B through an unobserved common parent. So the causal link is hard to find without “tricks.”
The dependence that arises from a conditioned common effect (simplest case A → [C] ← B) that people have brought up does arise in practice, usually if your samples aren’t independent. Typical case: phone surveys are only administered to people with phones. Or case control studies for rare diseases need to gather one arm from people who are actually already sick (called “outcome dependent sampling.”)
Phil Dawid works with DAG models that are partially causal and partially statistical. But I think we should first be very very clear on exactly what a statistical DAG model is, and what a causal DAG model is, and how they are different. Then we could start combining without confusion!
If you have a prior over DAG/mixed graph structures because you are Bayesian, you can obviously have beliefs about a causal relationship between A and B vs a dependent relationship between A and B, and update your beliefs based on evidence, etc.. Bayesian reasoning about causality does involve saying at some point “I have an assumption that is letting me draw causal conclusions from a fact I observed about a joint distribution,” which is not a trivial step (this is not unique to B of course—anyone who wants to do causality from observational data has to deal with this).
Pearl has this hypothesis that a lot of probabilistic fallacies/paradoxes/biases are due to the fact that causal and not probabilistic relationships are what our brain natively thinks about. So e.g. Simpson’s paradox is surprising because we intuitively think of a conditional distribution (where conditioning can change anything!) as a kind of “interventional distribution” (no Simpson’s type reversal under interventions: http://ftp.cs.ucla.edu/pub/stat_ser/r414.pdf).
This hypothesis would claim that people who haven’t looked into the math just interpret statements about conditional probabilities as about “interventional probabilities” (or whatever their intuitive analogue of a causal thing is).
Good comment—upvoted. Just a minor question:
You probably did not intend to imply that this was an arbitrary choice, but it would still be interesting to hear your thoughts on it. It seems to me that the choice to represent independences by missing arrows was necessary. If they had instead chosen to represent dependences by present arrows, I don’t see how the graphs would be useful for causal inference.
If missing arrows represent independences and the backdoor criterion holds, this is interpreted as “for all distributions that are consistent with the model, there is no confounding”. This is clearly very useful. If arrows represented dependences, it would instead be interpreted as “For at least one distribution that is consistent with the DAG model, there is no confounding”. This is not useful to the investigator.
Since unconfoundedness is an independence-relation, it is not clear to me how graphs that encode dependence-relations would be useful. Can you think of a graphical criterion for unconfoundedness in dependence graphs? Or would dependence graphs be useful for a different purpose?
Hi, thanks for this. I agree that this choice was not arbitrary at all!
There are a few related reasons why it was made.
(a) Pearl wisely noted that it is independences that we exploit for things like propagating beliefs around a sparse graph in polynomial time. When he was still arguing for the use of probability in AI, people in AI were still not fully on board, because they thought that to probabilistically reason about n binary variables we need a 2^n table for the joint, which is a non-starter (of course statisticians were on board w/ probability for hundreds of years even though they didn’t have computers—their solution was to use clever parametric models. In some sense Bayesian networks are just another kind of clever parametric model that finally penetrated the AI culture in the late 80s).
(b) We can define statistical (causal) models by either independences or dependences, but there is a lack of symmetry here that the symmetry of the “presence or absence of edges in a graph” masks. An independence is about a small part of the parameter space. That is, a model defined by an independence will correspond to a manifold of smaller dimension generally that sits in a space corresponding to a saturated model (no constraints). A model defined by dependences will just be that same space with a “small part” missing. Lowering dimension in a model is really nice in stats for a number of reasons.
(c) While conceivably we might be interested in a presence of a causal effect more than an absence of a causal effect, you are absolutely right that generally assumptions that allow us to equate a causal effect with some functional of observed data take the form of equality constraints (e.g. “independences in something.”) So it is much more useful to represent that even if we care about the presence of an effect at the end of the day. We can just see how far from null the final effect number is—we don’t need a graphical representation. However a graphical representation for assumptions we are exploiting to get the effect as a functional of observed data is very handy—this is what eventually led Jin Tian to his awesome identification algorithm on graphs.
(d) There is an interesting logical structure to conditional independence, e.g. Phil Dawid’s graphoid axioms. There is something like that for dependences (Armstrong’s axioms for functional dependence in db theory?) but the structure isn’t as rich.
edit: there are actually only two semi-graphoids : one for symmetry and one for chain rule.
edit^2: graphoids are not complete (because conditional independence is actually kind of a nasty relation). But at least it’s a ternary relation. There are far worse dragons in the cave of “equality constraints.”
Thanks for reading.
I tried to read that, but I think I didn’t understand too much of it or its connection to this topic. I’ll save that whole festschrift for later, there were some interesting titles in the table of contents.
I agree I did sort of conflate causal networks and Bayesian networks in general… I didn’t realize there was no clean way of having both at the same time.
It might help if I describe a concrete way to test my claim using just causal networks: generate a randomly connected causal network with x nodes and y arrows, where each arrow has some random noise in it; count how many pairs of nodes are in a causal relationship; now, 1000 times initialize the root nodes to random values and generate a possible state of the network & storing the values for each node; count how many pairwise correlations there are between all the nodes using the 1000 samples (using an appropriate significance test & alpha if one wants); divide # of causal relationships by # of correlations, store; return to the beginning and resume with x+1 nodes and y+1 arrows… As one graphs each x against its respective estimated fraction, does the fraction head toward 0 as x increases? My thesis is it does.
Interesting, and it reminds me of what happens in physics classes: people learn how to memorize teachers’ passwords, but go on thinking in folk-Aristotelian physics fashion, as revealed by simple multiple-choice tests designed to hone in on the appealing folk-physics misconceptions vs ‘unnatural’ Newtonian mechanics. That’s a plausible explanation, but I wonder if anyone has established more directly that people really do reason causally even when they know they’re not supposed to? Offhand, it doesn’t really sound like any bias I can think of. It shouldn’t be too hard to develop such a test for teachers of causality material, just take common student misconceptions or deadends and refine them into a multiple-choice test. I’d bet stats 101 courses have as much problems as intro physics courses.
That seems to make sense to me.
I’m not sure about marginal dependence.
I’m afraid I don’t understand you here. If we draw an arrow from A to B, either as a causal or Bayesian net, because we’ve observed correlation or causation (maybe we actually randomized A for once), how can there not be a relationship in any underlying reality and there actually be an ‘independence’ and the graph be ‘unfaithful’?
Anyway, it seems that either way, there might be something to this idea. I’ll keep it in mind for the future.
This post is a good example of why LW is dying. Specifically, that it was posted as a comment to a garbage-collector thread in the second-class area. Something is horribly wrong with the selection mechanism for what gets on the front page.
Underconfidence is a sin. This is specifically about gwern’s calibration. (EDIT: or his preferences)
Not everyone is in the same situation. I mean, we recently had an article disproving theory of relativity posted in Main (later moved to Discussion). Texts with less value that gwern’s comment do regularly get posted as articles. So it’s not like everyone is afraid to post anything. Some people should update towards posting articles, some people should update towards trying their ideas in Open Thread first. Maybe they need a little nudge from outside first.
How about adding this text to the Open Thread introduction?
As a rule of thumb, if your top-level comment here received 15 or more karma, you probably should repost it as a separate article—either in the same form, or updated. (And probably post texts of similar quality directly as articles in the future.)
And a more courageous idea: Make a script which collects all top-level Open Thread articles with 15 and more karma from all Open Threads in history and sends their authors a message (one message per author with links to all such comments, to prevent inbox flood) that they should consider posting this as an article.
More generally, I’d say that if it’s longer than about four paragraphs, it’s probably better suited as its own article than as a comment.
Let’s talk about the fact that the top two comments on a very nice contribution in the open thread is about how this is the wrong place for the post, or how it is why LW is dying. Actually let’s not talk about that.
[deleted]
I think that deserves its own post.
It would be interesting to try to come up with good priors for random causal networks.
A simple informal reason for giving a small probability to “A causes B” could be this:
For any fixed values A and B, there is only one explanation “A causes B”, one explanation “B causes A”, and many explanations “C causes A and B” (for many different values of C). If we split 100% between all these explanations, the last group gets most of the probability mass.
And as you said, the more complex given field is, the more realistic values of C there are.
You are leaving out one type of causation. It is possible that you are conditioning on some common effect C of A and B.
You may argue that this does not actually give a correlation between A and B, it only give a correlation between A given C and B given C. However, in real life, there will always be things you condition on whenever you collect data, so you cannot completely remove this possibility.
This would be the same sort of selection-causing-pseudo-correlations that Yvain discusses in http://slatestarcodex.com/2014/03/01/searching-for-one-sided-tradeoffs/ ? Hm… I think I would lump that in with my response to Nancy (‘yes, we rarely have huge _n_s and can wish away sampling error and yes our data collection is usually biased or conditioned in ways we don’t know but let’s ignore that to look at the underlying stuff’).
Even if we promoted it to the level of the other 3 causation patterns, does that change any of my arguments? It seems like another way of producing correlations-which-aren’t-due-to-direct-causation just emphasizes the point.
Yes, I am talking about the exact same thing that Yvain is talking about there.
So, I think any time you observe a correlation, it is because of one of those 4 causation patterns, so even if the fourth does not show up as regularly as the other 3, you should include it for completeness.
Regarding the psychology of why people overestimate the correlation-causation link, I was just recently reading this, and something vaguely relevant struck my eye:
Say you do this, and you find that about 10% of all correlations in a dataset are shown to have a causal link. Can you then look for a correlation between certain aspects of a correlation (such as coefficient, field of study) and those correlations which are causal?
Building on this, you might establish something like “correlations at .9 are more likely to be causal than correlations at .7” and establish a causal mechanism for this. Alternatively, you might find that “correlations from the field of farkology are more often causal than correlations from spleen medicine”, and find a causal explanation for this.
Part or all of this explanation might involve the size of the causal network. It could well be that both correlation coefficients and field of study are just proxy variables to describe the size of a network, and that’s the only important factor in the ratio of correlations to causal links, but it might be the case that there is more to it.
This could lead to quite a bit of trouble in academic literature, as measures of what evidence a correlation is for causation will become dependent on a set of variables about the context you’re working in, and this could potentially be gamed. In fact, that could be the case even with gwern’s original proposition—claiming you’re working with a small causal net could be enough to lend strong evidence to a causal claim based on correlation, and it’s only by having someone point out that your causal net is lacking that this evidence can have its weighting adjusted.
All these thoughts are sketchy outlines of an extension of what gwern’s brought up. More considered comment is welcome.
I would be very surprised if this was not the case. Different fields already use different cutoffs for statistical-significance (you might get away with p<0.05 in psychology, but particle physics likes its five-sigmas, and in genomics the cutoff will be hundreds or thousands of times smaller and vary heavily based on what exactly you’re analyzing) and likewise have different expectations for effect sizes (psychology expects large effects, medicine expects medium effects, and genomics expects very small effects; eg for genetic influence on IQ, any claim of a allele with an effect larger than d=0.06 should be greeted with surprise and alarm).
I think that there is going to be a relationship, but it’ll be hard to describe precisely. Suppose we correlated A and B and found r=0.9. This is a large correlation by most fields’ standards, and it would seem to put constraints on the causal net that A and B are part of: either there aren’t many nodes ‘in between’ A and B (because each node is a chance for the correlation to diminish and be lost in influence from all the neighboring nodes, with their own connections) or the nodes are powerfully correlated so the net correlation can still be as high as 0.9.
To a large extent, this is already the case (see above). People justify results with relation to implicit models and supposed analysis procedures (‘we did reported t-test so we are entitled to declare p<0.05 statistically-significant (never mind all the tweaks we tried and interim tests while collecting data)’). The existing defaults aren’t usually well-justified: for example, why does psychology use 0.05 rather than 0.10 or 0.01? ‘Surely God loves p=0.06 almost as much as he loves the p=0.05’ one line goes.
This is a good point, and leads to what might be an interesting use of the experimental approach of linking correlations to causation: gauging whether the heuristics currently in use in a field are at a suitable level/reflect the degree to which correlation is evidence for causation.
If you were to find, for example, that physics is churning out huge sigmas where it doesn’t really need to, or psychology really really needs to up its standards of evidence (not that that in itself would be a surprising result), those could be very interesting results.
Of course, to run these experiments you need large samples of well-researched correlations you can easily and objectively test for causality, from all the fields you’re looking at, which is no small requirement.
If the falling price of gene sequencing lets us determine a lot about how genes influence human behavior social scientists, I predict, will get a lot better at figuring out the causal effects of social programs.
Once social scientists get past their taboo against genetic explanations.
Better genetic analysis will make it easier to discuss politically incorrect topics because rather than talking about IQ you could discuss complex gene clusters characterized by hard to understand mathematical correlations. And I strongly suspect that with a better understanding of genetics race would become much less significant in statistical analysis because after you account for genetics you would gain little statistical significance by directly adding race into a regression (i.e. if gene X does something important and 80% of Asians but only 5% of whites have the gene then without genetic analysis race is important but after you know who has the gene race isn’t statistically significant.)
There are at least two more possibilities: A and B are unrelated, but happen to be in sync for a while, and the data was collected wrong in some way.
I’m choosing to ignore that possibility to clarify the exposition of what I think is going on. Problems like that are what I’m referring to when I preface:
Even if we had enormous clean datasets showing correlations to whatever level of statistical-significance you please, you still can’t spin the straw of correlation into the gold of causation, and the question remains why.
You could say that “A and B happen to be in sync for a while” is possibility 3, where C is the passage of time. (Unless by “happen to be in sync for a while” you mean that they appear to be correlated because of a fluke.)
To generalize, it’s also possible that you’re observing survivor effects, i.e., both A and not B (or B and not A) cause the data to appear in your data set.
Test reply please ignore
This seems deserving of at the very least a full post, possibly in main.
..oh, that wasn’t a delete button. Not actually retracted.
When it comes to the replication of those breakthrough cancer results I think you can’t forget publication bias. A lab runs an experiment 6 times. If it produces results in one of those trials they write a paper.