Indeed, it does often happen that an incorrect model is assigned higher prior probability, because that incorrect model is simpler. The usual expectation, in such cases, is that the true model will quickly win out once one starts updating on data. In your example, when updating on data, one would presumably find that e.g. “tired” and “swimming” are not independent, and their empirical correlation (in the data) can therefore be accounted for by the “more complex” (lower prior) model, but not by the “simpler” (higher prior) model.
Tired and swimming are not independent, but that’s a correlational error. You can indeed get a more accurate picture of the correlations, given more evidence, but you cannot conclude causational structure from correlations alone.
How about this: would any amount of observation ever cause one to conclude that camping causes swimming rather than the reverse? The answer is clearly no: they are correlated, but there’s no way to use the correlation between them (or their relationships to any other variables) to distinguish between swimming causing camping and camping causing swimming.
You can indeed get a more accurate picture of the correlations, given more evidence, but you cannot conclude causational structure from correlations alone.
How about this: would any amount of observation ever cause one to conclude that camping causes swimming rather than the reverse?
You can totally conclude causational structure from correlations alone, it just requires observing more variables. Judea Pearl is the canonical source on the topic (Causality for the full thing, Book of Why or Causal Inference In Statistics for intros written for broader audiences); Yudkowsky also has a good intro here which cuts right to the chase.
Thanks for linking to Yudkowsky’s post (though it’s a far cry from cutting to the chase… I skipped a lot of superfluous text in my skim). It did change my mind a bit, and I see where you’re coming from. I still disagree that it’s of much practical relevance: in many cases, no matter how many more variables you observe, you’ll never conclude the true causational structure. That’s because it strongly matters which additional variables you’ll observe.
Let me rephrase Yudkowsky’s point (and I assume also your point) like this. We want to know if swimming causes camping, or if camping causes swimming. Right now we know only that they correlate. But if we find another variable that correlates with swimming and is independent camping, that would be evidence towards “camping causes swimming”. For example, if swimming happens on Tuesdays but camping is independent of Tuesdays, it’s suggestive that camping causes swimming (because if swimming caused camping, you’d expect the Tuesday/swimming correlation to induce a Tuesday/camping correlation).
First, I admit that this is a neat observation that I haven’t fully appreciated or knew how to articulate before reading the article. So thanks for that. It’s food for thought.
Having said that, there are still a lot of problems with this story:
First, unnatural variables are bad: I can always take something like “an indicator variable for camping, except if swimming is present, negate this indicator with probability p”. This variable, call it X, can be made to be uncorrelated with swimming by picking p correctly, yet it will be correlated with camping; hence, by adding it, I can cause the model to say swimming causes camping. (I think I can even make the variable independent of swimming instead of just uncorrelated, but I didn’t check.) So to trust this model, I’d either need some assumption that the variables are somehow “natural”. Not cherry-picked, not handed to me by some untrusted source with stake in the matter.
In practice, it can be hard to find any good variables that correlate with one thing but not the other. For example, suppose you’re trying to establish “lead exposure in gestation causes low IQ”. Good luck trying to find something natural that correlates with low neonatal IQ but not with lead; everything will be downstream of SES. And you don’t get to add SES to your model, because you never observe it directly!
More generally, real life has these correlational clusters, these “positive manifolds” of everything-correlating-with-everything. Like, consumption of all “healthy” foods correlates together, and also correlates with exercise, and also with not being overweight, and also with longevity, etc. In such a world, adding more variables will just never disentangle the causational structure at all, because you never find yourself adding a variable that’s correlated with one thing but not another.
Problem 1 is mostly handled automagically by doing Bayesian inference to choose our model. The key thing to notice is that the “unnatural” variable’s construction requires that we know what value to give p, which is something we’d typically have to learn from the data itself. Which means, before seeing the data, that particular value of p would typically have low prior. Furthermore, as more data lets us estimate things more precisely, we’d have to make p more precise in order to keep perfectly negating things, and p-to-within-precision has lower and lower prior as the precision gets smaller. (Though there will be cases where we expect a priori for parameters to take on values which make the causal structure ambiguous—agentic systems under selection pressure are a central example, and that’s one of the things which make agentic systems particularly interesting to study.)
Problems 2-3 are mostly handled by building models of a whole bunch of variables at once. We’re not just looking for e.g. a single variable which correlates with low neonatal IQ but not with lead. Slightly more generally, for instance, a combination of variables which correlates with low neonatal IQ but not with lead, conditional on some other variables, would suffice (assuming we correctly account for multiple hypothesis testing). And the “conditional on some other variables” part could, in principle, account for SES, insofar as we use enough variables to basically determine SES to precision sufficient for our purposes.
More generally than that: once we’ve accepted that correlational data can basically power Bayesian inference of causality, we can in-principle just do Bayesian inference on everything at once, and account for all the evidence about causality which the data provides—some of which will be variables that correlate with one thing but not the other, but some of which will be other stuff than that (like e.g. Markov blankets informing graph structure, or two independent variables becoming dependent when conditioning on a third). Variables which correlate with one thing but not the other are just one particularly-intuitive small-system example of the kind of evidence we can get about causal structure.
(Also I strong-upvoted your comment for rapid updating, kudos.)
I don’t think problem 1 is so easy to handle. It’s true that I’ll have a hard time finding a variable that’s perfectly independent of swimming but correlated with camping. However, I don’t need to be perfect to trick your model.
Suppose every 4th of July, you go camping at one particular spot that does not have a lake. Then we observe that July 4th correlates with camping but does not correlate with swimming (or even negatively correlates with swimming). The model updates towards swimming causing camping. Getting more data on these variables only reinforces the swimming->camping direction.
To update in the other direction, you need to find a variable that correlates with swimming but not with camping. But what if you never find one? What if there’s no simple thing that causes swimming. Say I go swimming based on the roll of a die, but you don’t get to ever see the die. Then you’re toast!
Slightly more generally, for instance, a combination of variables which correlates with low neonatal IQ but not with lead, conditional on some other variables, would suffice (assuming we correctly account for multiple hypothesis testing). And the “conditional on some other variables” part could, in principle, account for SES, insofar as we use enough variables to basically determine SES to precision sufficient for our purposes.
Oh, sure, I get that, but I don’t think you’ll manage to do this, in practice. Like, go ahead and prove me wrong, I guess? Is there a paper that does this for anything I care about? (E.g. exercise and overweight, or lead and IQ, or anything else of note). Ideally I’d get to download the data and check if the results are robust to deleting a variable or to duplicating a variable (when duplicating, I’ll add noise so that they variables aren’t exactly identical).
If you prefer, I can try to come up with artificial data for the lead/IQ thing in which I generate all variables to be downstream of non-observed SES but in which IQ is also slightly downstream of lead (and other things are slightly downstream of other things in a randomly chosen graph). I’ll then let you run your favorite algorithm on it. What’s your favorite algorithm, by the way? What’s been mentioned so far sounds like it should take exponential time (e.g. enumerating over all ordering of the variables, drawing the Bayes net given the ordering, and then picking the one with fewest parameters—that takes exponential time).
(This is getting into the weeds enough that I can’t address the points very quickly anymore, they’d require longer responses, but I’m leaving a minor note about this part:
Suppose every 4th of July, you go camping at one particular spot that does not have a lake. Then we observe that July 4th correlates with camping but does not correlate with swimming (or even negatively correlates with swimming).
For purposes of causality, negative correlation is the same as positive. The only distinction we care about, there, is zero or nonzero correlation.)
For purposes of causality, negative correlation is the same as positive. The only distinction we care about, there, is zero or nonzero correlation.)
That makes sense. I was wrong to emphasize the “even negatively”, and should instead stick to something like “slightly negatively”. You have to care about large vs. small correlations or else you’ll never get started doing any inference (no correlations are ever exactly 0).
You can totally conclude causational structure from correlations alone, it just requires observing more variables. Judea Pearl is the canonical source on the topic
I am surprised by this claim, because Pearl stresses that you can get no causal conclusions without causal assumptions.
How, specifically, would you go about discovering the correct causal structure of a phenomenon from correlations alone?
Eh, Pearl’s being a little bit coy. We can typically get away with some very weak/general causal assumptions—e.g. “parameters just happening to take very precise values which perfectly mask the real causal structure is improbable a priori” is roughly the assumption Pearl mostly relies on (under the guise of “minimality and stability” assumptions). Causality chapter 2 walks through a way to discover causal structure from correlations, leveraging those assumptions, though the algorithms there aren’t great in practice—“test gazillions of conditional independence relationships” is not something one can do in practice without a moderate rate of errors along the way, and Pearl’s algorithm assumes it as a building block. Still, it makes the point that this is possible in principle, and once we accept that we can just go full Bayesian model comparison.
I think of these assumptions in a similar way to e.g. independence assumptions across “experiments” in standard statistics (though I’d consider the assumptions needed for causality much weaker than those). Like, sure, we need to make some assumptions in order to do any sort of mathematical modelling, and that somewhat limits how/where we apply the theory, but it’s not that much of a barrier in practice.
Indeed, it does often happen that an incorrect model is assigned higher prior probability, because that incorrect model is simpler. The usual expectation, in such cases, is that the true model will quickly win out once one starts updating on data. In your example, when updating on data, one would presumably find that e.g. “tired” and “swimming” are not independent, and their empirical correlation (in the data) can therefore be accounted for by the “more complex” (lower prior) model, but not by the “simpler” (higher prior) model.
Tired and swimming are not independent, but that’s a correlational error. You can indeed get a more accurate picture of the correlations, given more evidence, but you cannot conclude causational structure from correlations alone.
How about this: would any amount of observation ever cause one to conclude that camping causes swimming rather than the reverse? The answer is clearly no: they are correlated, but there’s no way to use the correlation between them (or their relationships to any other variables) to distinguish between swimming causing camping and camping causing swimming.
You can totally conclude causational structure from correlations alone, it just requires observing more variables. Judea Pearl is the canonical source on the topic (Causality for the full thing, Book of Why or Causal Inference In Statistics for intros written for broader audiences); Yudkowsky also has a good intro here which cuts right to the chase.
Thanks for linking to Yudkowsky’s post (though it’s a far cry from cutting to the chase… I skipped a lot of superfluous text in my skim). It did change my mind a bit, and I see where you’re coming from. I still disagree that it’s of much practical relevance: in many cases, no matter how many more variables you observe, you’ll never conclude the true causational structure. That’s because it strongly matters which additional variables you’ll observe.
Let me rephrase Yudkowsky’s point (and I assume also your point) like this. We want to know if swimming causes camping, or if camping causes swimming. Right now we know only that they correlate. But if we find another variable that correlates with swimming and is independent camping, that would be evidence towards “camping causes swimming”. For example, if swimming happens on Tuesdays but camping is independent of Tuesdays, it’s suggestive that camping causes swimming (because if swimming caused camping, you’d expect the Tuesday/swimming correlation to induce a Tuesday/camping correlation).
First, I admit that this is a neat observation that I haven’t fully appreciated or knew how to articulate before reading the article. So thanks for that. It’s food for thought.
Having said that, there are still a lot of problems with this story:
First, unnatural variables are bad: I can always take something like “an indicator variable for camping, except if swimming is present, negate this indicator with probability p”. This variable, call it X, can be made to be uncorrelated with swimming by picking p correctly, yet it will be correlated with camping; hence, by adding it, I can cause the model to say swimming causes camping. (I think I can even make the variable independent of swimming instead of just uncorrelated, but I didn’t check.) So to trust this model, I’d either need some assumption that the variables are somehow “natural”. Not cherry-picked, not handed to me by some untrusted source with stake in the matter.
In practice, it can be hard to find any good variables that correlate with one thing but not the other. For example, suppose you’re trying to establish “lead exposure in gestation causes low IQ”. Good luck trying to find something natural that correlates with low neonatal IQ but not with lead; everything will be downstream of SES. And you don’t get to add SES to your model, because you never observe it directly!
More generally, real life has these correlational clusters, these “positive manifolds” of everything-correlating-with-everything. Like, consumption of all “healthy” foods correlates together, and also correlates with exercise, and also with not being overweight, and also with longevity, etc. In such a world, adding more variables will just never disentangle the causational structure at all, because you never find yourself adding a variable that’s correlated with one thing but not another.
Problem 1 is mostly handled automagically by doing Bayesian inference to choose our model. The key thing to notice is that the “unnatural” variable’s construction requires that we know what value to give p, which is something we’d typically have to learn from the data itself. Which means, before seeing the data, that particular value of p would typically have low prior. Furthermore, as more data lets us estimate things more precisely, we’d have to make p more precise in order to keep perfectly negating things, and p-to-within-precision has lower and lower prior as the precision gets smaller. (Though there will be cases where we expect a priori for parameters to take on values which make the causal structure ambiguous—agentic systems under selection pressure are a central example, and that’s one of the things which make agentic systems particularly interesting to study.)
Problems 2-3 are mostly handled by building models of a whole bunch of variables at once. We’re not just looking for e.g. a single variable which correlates with low neonatal IQ but not with lead. Slightly more generally, for instance, a combination of variables which correlates with low neonatal IQ but not with lead, conditional on some other variables, would suffice (assuming we correctly account for multiple hypothesis testing). And the “conditional on some other variables” part could, in principle, account for SES, insofar as we use enough variables to basically determine SES to precision sufficient for our purposes.
More generally than that: once we’ve accepted that correlational data can basically power Bayesian inference of causality, we can in-principle just do Bayesian inference on everything at once, and account for all the evidence about causality which the data provides—some of which will be variables that correlate with one thing but not the other, but some of which will be other stuff than that (like e.g. Markov blankets informing graph structure, or two independent variables becoming dependent when conditioning on a third). Variables which correlate with one thing but not the other are just one particularly-intuitive small-system example of the kind of evidence we can get about causal structure.
(Also I strong-upvoted your comment for rapid updating, kudos.)
I don’t think problem 1 is so easy to handle. It’s true that I’ll have a hard time finding a variable that’s perfectly independent of swimming but correlated with camping. However, I don’t need to be perfect to trick your model.
Suppose every 4th of July, you go camping at one particular spot that does not have a lake. Then we observe that July 4th correlates with camping but does not correlate with swimming (or even negatively correlates with swimming). The model updates towards swimming causing camping. Getting more data on these variables only reinforces the swimming->camping direction.
To update in the other direction, you need to find a variable that correlates with swimming but not with camping. But what if you never find one? What if there’s no simple thing that causes swimming. Say I go swimming based on the roll of a die, but you don’t get to ever see the die. Then you’re toast!
Oh, sure, I get that, but I don’t think you’ll manage to do this, in practice. Like, go ahead and prove me wrong, I guess? Is there a paper that does this for anything I care about? (E.g. exercise and overweight, or lead and IQ, or anything else of note). Ideally I’d get to download the data and check if the results are robust to deleting a variable or to duplicating a variable (when duplicating, I’ll add noise so that they variables aren’t exactly identical).
If you prefer, I can try to come up with artificial data for the lead/IQ thing in which I generate all variables to be downstream of non-observed SES but in which IQ is also slightly downstream of lead (and other things are slightly downstream of other things in a randomly chosen graph). I’ll then let you run your favorite algorithm on it. What’s your favorite algorithm, by the way? What’s been mentioned so far sounds like it should take exponential time (e.g. enumerating over all ordering of the variables, drawing the Bayes net given the ordering, and then picking the one with fewest parameters—that takes exponential time).
(This is getting into the weeds enough that I can’t address the points very quickly anymore, they’d require longer responses, but I’m leaving a minor note about this part:
For purposes of causality, negative correlation is the same as positive. The only distinction we care about, there, is zero or nonzero correlation.)
That makes sense. I was wrong to emphasize the “even negatively”, and should instead stick to something like “slightly negatively”. You have to care about large vs. small correlations or else you’ll never get started doing any inference (no correlations are ever exactly 0).
I am surprised by this claim, because Pearl stresses that you can get no causal conclusions without causal assumptions.
How, specifically, would you go about discovering the correct causal structure of a phenomenon from correlations alone?
Eh, Pearl’s being a little bit coy. We can typically get away with some very weak/general causal assumptions—e.g. “parameters just happening to take very precise values which perfectly mask the real causal structure is improbable a priori” is roughly the assumption Pearl mostly relies on (under the guise of “minimality and stability” assumptions). Causality chapter 2 walks through a way to discover causal structure from correlations, leveraging those assumptions, though the algorithms there aren’t great in practice—“test gazillions of conditional independence relationships” is not something one can do in practice without a moderate rate of errors along the way, and Pearl’s algorithm assumes it as a building block. Still, it makes the point that this is possible in principle, and once we accept that we can just go full Bayesian model comparison.
I think of these assumptions in a similar way to e.g. independence assumptions across “experiments” in standard statistics (though I’d consider the assumptions needed for causality much weaker than those). Like, sure, we need to make some assumptions in order to do any sort of mathematical modelling, and that somewhat limits how/where we apply the theory, but it’s not that much of a barrier in practice.