As a regular LW reader who has never been that into causality, this reads as a blisteringly hot take to me. My first thought is, what about acausal correlations? You could have two instances of the same program running on opposite sides of the universe, and their outputs would be the same, but there is clearly no causal influence there. The next example that comes to mind is two planets orbiting their respective stars which just so happen to have the same orbital period; their angular offset over time will correlate, and again their is no common cause.
(In both cases you could say that the common cause is something like the laws of physics allowing two copies of similar systems to come into existence, but I would say that stretches the concept of causality beyond usefulness.)
I also notice that there’s no wikipedia page for “Reichenbach’s Common Cause Principle”, which makes me think it’s not a particularly widely accepted idea. (In any case I don’t think this has an effect on the value of the rest of this sequence.)
In the example of the two programs, we have to be careful with what we mean by statistical correlation v.s. more standard / colloquial use of the term. Im assuming here when you say `the same program running on opposite ends of the universe, and their outputs would be the same’ that you are referring to a deterministic program (else, there would be no guarantee that the outputs were the same). But, if the output of the two programs is deterministic, then there can be no statistical correlation between them. Let A be the outcome of the first program and B the outcome of the second. To measure statistical correlation we have to run the two programs many times generating i.i.d. samples of A and B, and they are correlated if P(A, B) is not equal to P(A)P(B). But if the two programs are deterministic, say A = a and B = b with probability 1, then they are not statistically correlated, as P(A = a, B = b) = 1 and P(A = a)P(B = b) = 1. So to get some correlation the output of the programs have to be random. To have two random algorithms generating correlated outcomes, they need to share some randomness they can condition their outputs on, i.e. a common cause. With the two planets example, we run into the same problem again. (PS by correlation Reichenbach means statistical dependence here rather than e.g. Pearson correlation, but the same argument applies).
Broadly speaking, to point of the (perhaps confusing) reference in the article is to say that if we accept the laws of physics, then all the things we observe in the universe are ultimately generated by causal dynamics (e.g. the classical equations of motion being applied to some initial conditions.) We can always describe these causal dynamics + initial conditions using a causal model. So there is always `some causal model’ that describes our data.
Thanks for writing that out! I’ve enjoyed thinking this through some more.
I agree that, if you instantiated many copies of the program across the universe as your sampling method, or somehow otherwise “ran them many times”, then their outputs would be independent in the sense that P(A, B) = P(A, B). This also holds true if, on each run, there was some “local” error to the program’s otherwise deterministic output.
I had intended to be using the program’s output as a time series of bits, where we are considering the bits to be “sampling” from A and B. Let’s say it’s a program that outputs the binary digits of pi. I have no idea what the bits are (after the first few) but there is a sense in which P(A) = 0.5 for either A = 0 or A = 1, and at any timestep. The same is true for P(B). So P(A)P(B) = 0.25. But clearly P(A = 0, B = 0) = 0.5, and P(A = 0, B = 1) = 0, et cetera. So in that case, they’re not probabilistically independent, and therefore there is a correlation not due to a causal influence.
But this is in a Bayesian framing, where the probability isn’t a physical thing about the programs, it’s a thing inside my mind. So, while there is a common source of the correlation (my uncertainty over what the digits of pi are) it’s certainly not a “causal influence” on A and B.
This matters to me because, in the context of agent foundations and AI alignment, I want my probabilities to be representing my state of belief (or the agent’s state of belief).
I had intended to be using the program’s output as a time series of bits, where we are considering the bits to be “sampling” from A and B. Let’s say it’s a program that outputs the binary digits of pi. I have no idea what the bits are (after the first few) but there is a sense in which P(A) = 0.5 for either A = 0 or A = 1, and at any timestep. The same is true for P(B). So P(A)P(B) = 0.25. But clearly P(A = 0, B = 0) = 0.5, and P(A = 0, B = 1) = 0, et cetera. So in that case, they’re not probabilistically independent, and therefore there is a correlation not due to a causal influence.
Just to chip in on this: in the case you’re describing, the numbers are not statistically correlated, because they are not random in the statistics sense. They are only random given logical uncertainty.
When considering logical “random” variables, there might well be a common logical “cause” behind any correlation. But I don’t think we know how to properly formalise or talk about that yet. Perhaps one day we can articulate a logical version of Reichenbach’s principle :)
Yeah, I think I agree that the resolution here is something about how we should use these words. In practice I don’t find myself having to distinguish between “statistics” and “probability” and “uncertainty” all that often. But in this case I’d be happy to agree that “all statistical correlations are due to casual influences” given that we mean “statistical” in a more limited way than I usually think of it.
But I don’t think we know how to properly formalise or talk about that yet.
A group of LessWrong contributors has made a lot of progress on these ideas of logical uncertainty and (what I think they’re now calling) functional decision theory over the last 15ish years, although I don’t really follow it myself, so I’m not sure how close they’d say we are to having it properly formalized.
Thanks for commenting! This is an interesting question and answering it requires digging into some of the subtleties of causality. Unfortunately the time series framing you propose doesnt work because this time series data is not iid (the variable A = “the next number out of program 1” is not iid), while by definition the distributions P(A), P(B) and P(A,B) you are reasoning with are assuming iid. We really have to have iid here, otherwise we are trying to infer correlation from a single sample. By treating non-iid variables as iid we can see correlations where there are no correlations, but those correlations come from the fact that the next output depends on the previous output, not because the output of one program depends on the output of the other program.
We can fix this by imagining a slightly different setup that I think is faithful to your proposal. Basically the same thing but instead of computing pi, both the programs have in memory a random string of bits, with 0 or 1 occurring with probability 1⁄2 for each bit. Both programs just read out the string. Let the string of random bits be identical for program 1 and 2. Now, we can describe each output of the programs as iid. If these are the same for both program, the outputs of the programs are perfectly correlated. And you are right, by looking at the output of one of the programs I can update by beliefs on the output of the other program.
Then we need to ask, how do we generate this experiment? To get the string of random bits we have to sample a coin flip, and then make two copies of the outcome and send it to both programs. If we tried to do this with two coins separately at different ends of the universe, we would get diffrent bit strings. So the two programs have in their past light cones a shared source of randomness—this is the common cause.
I’d agree that the bits of output are not independent in some physical sense. But they’re definitely independent in my mind! If I hear that the 100th binary digit of pi is 1, then my subjective probability over the 101st digit does not update at all, and remains at 0.5/0.5. So this still feels like a frequentism/Bayesianism thing to me.
Re: the modified experiment about random strings, you say that “To get the string of random bits we have to sample a coin flip, and then make two copies of the outcome”. But there’s nothing preventing the universe from simply containing to copies of the same random string, created causally independently. But that’s also vanishingly unlikely as the string gets longer.
Yes I can flip two independent coins a finite number of times and get strings that appear to be correlated. But in the asymptotic limit the probability they are the same (or correlated at all) goes to zero. Hence, two causally unrelated things can appear dependent for finite sample sizes. But when we have infinite samples (which is the limit we assume when making statements about probabilities) we get P(a,b) = P(a)P(b).
It may be useful to know that if events all obey the Markov property (they are probability distributions, conditional on some set of causal parents), then the Reichenbach Common Cause Principle follows (by d-separation arguments) as a theorem. So any counterexamples to RCCP must violate the Markov property as well.
There’s also a lot of interesting discussion here.
Ultimately, all statistical correlations are due to casual influences.
As a regular LW reader who has never been that into causality, this reads as a blisteringly hot take to me.
You are right this is somewhat blistering, especially for this LW forum.
I would have been less controversial for the authors to say that ‘all statistical correlations can be modelled as casual influences’. Correlations between two observables can always be modelled as being caused by the causal dependence of both on the value of a certain third variable, which may (if the person making the model wants to) be defined as a hidden variable that cannot by definition be observed.
After is has been drawn up, such a causal model claiming that an observed statistical correlation is being caused by a causal dependency on a hidden variable might then be either confirmed or falsified, for certain values of confirmed or falsified that philosophers love to endlessly argue about, by 1) further observations or by 2) active experiment, an experiment where one does a causal intervention.
Pearl kind of leans towards 2) the active experiment route towards confirming or falsifying the model—deep down, one of the points Pearl makes is that experiments can be used to distinguish between correlation and causation, that this experimentalist route has been ignored too much by statisticians and Bayesian philosophers alike, and that this route has also been improperly maligned by the Cigarette industry and other merchants of doubt.
Another point Pearl makes is that Pearl causal models and Pearl counterfactuals are very useful of mathematical tools that could be used by ex-statisticians turned experimentalists when they try to understand, and/or make predictions about, nondeterministic systems with potentially hidden variables.
This latter point is mostly made by Pearl towards the medical community. But this point also applies to doing AI interpretability research.
When it comes to the more traditional software engineering and physical systems engineering communities, or the experimental physics community for that matter, most people in these communities intuitively understand Pearl’s point about the importance of doing causal intervention based experiments as being plain common sense. They understand this without ever having read the work or the arguments of Pearl first. These communities also use mathematical tools which are equivalent to using Pearl’s do() notation, usually without even knowing about this equivalence.
As a regular LW reader who has never been that into causality, this reads as a blisteringly hot take to me. My first thought is, what about acausal correlations? You could have two instances of the same program running on opposite sides of the universe, and their outputs would be the same, but there is clearly no causal influence there. The next example that comes to mind is two planets orbiting their respective stars which just so happen to have the same orbital period; their angular offset over time will correlate, and again their is no common cause.
(In both cases you could say that the common cause is something like the laws of physics allowing two copies of similar systems to come into existence, but I would say that stretches the concept of causality beyond usefulness.)
I also notice that there’s no wikipedia page for “Reichenbach’s Common Cause Principle”, which makes me think it’s not a particularly widely accepted idea. (In any case I don’t think this has an effect on the value of the rest of this sequence.)
In the example of the two programs, we have to be careful with what we mean by statistical correlation v.s. more standard / colloquial use of the term. Im assuming here when you say `the same program running on opposite ends of the universe, and their outputs would be the same’ that you are referring to a deterministic program (else, there would be no guarantee that the outputs were the same). But, if the output of the two programs is deterministic, then there can be no statistical correlation between them. Let A be the outcome of the first program and B the outcome of the second. To measure statistical correlation we have to run the two programs many times generating i.i.d. samples of A and B, and they are correlated if P(A, B) is not equal to P(A)P(B). But if the two programs are deterministic, say A = a and B = b with probability 1, then they are not statistically correlated, as P(A = a, B = b) = 1 and P(A = a)P(B = b) = 1. So to get some correlation the output of the programs have to be random. To have two random algorithms generating correlated outcomes, they need to share some randomness they can condition their outputs on, i.e. a common cause. With the two planets example, we run into the same problem again. (PS by correlation Reichenbach means statistical dependence here rather than e.g. Pearson correlation, but the same argument applies).
Broadly speaking, to point of the (perhaps confusing) reference in the article is to say that if we accept the laws of physics, then all the things we observe in the universe are ultimately generated by causal dynamics (e.g. the classical equations of motion being applied to some initial conditions.) We can always describe these causal dynamics + initial conditions using a causal model. So there is always `some causal model’ that describes our data.
Thanks for writing that out! I’ve enjoyed thinking this through some more.
I agree that, if you instantiated many copies of the program across the universe as your sampling method, or somehow otherwise “ran them many times”, then their outputs would be independent in the sense that P(A, B) = P(A, B). This also holds true if, on each run, there was some “local” error to the program’s otherwise deterministic output.
I had intended to be using the program’s output as a time series of bits, where we are considering the bits to be “sampling” from A and B. Let’s say it’s a program that outputs the binary digits of pi. I have no idea what the bits are (after the first few) but there is a sense in which P(A) = 0.5 for either A = 0 or A = 1, and at any timestep. The same is true for P(B). So P(A)P(B) = 0.25. But clearly P(A = 0, B = 0) = 0.5, and P(A = 0, B = 1) = 0, et cetera. So in that case, they’re not probabilistically independent, and therefore there is a correlation not due to a causal influence.
But this is in a Bayesian framing, where the probability isn’t a physical thing about the programs, it’s a thing inside my mind. So, while there is a common source of the correlation (my uncertainty over what the digits of pi are) it’s certainly not a “causal influence” on A and B.
This matters to me because, in the context of agent foundations and AI alignment, I want my probabilities to be representing my state of belief (or the agent’s state of belief).
Just to chip in on this: in the case you’re describing, the numbers are not statistically correlated, because they are not random in the statistics sense. They are only random given logical uncertainty.
When considering logical “random” variables, there might well be a common logical “cause” behind any correlation. But I don’t think we know how to properly formalise or talk about that yet. Perhaps one day we can articulate a logical version of Reichenbach’s principle :)
Yeah, I think I agree that the resolution here is something about how we should use these words. In practice I don’t find myself having to distinguish between “statistics” and “probability” and “uncertainty” all that often. But in this case I’d be happy to agree that “all statistical correlations are due to casual influences” given that we mean “statistical” in a more limited way than I usually think of it.
A group of LessWrong contributors has made a lot of progress on these ideas of logical uncertainty and (what I think they’re now calling) functional decision theory over the last 15ish years, although I don’t really follow it myself, so I’m not sure how close they’d say we are to having it properly formalized.
nice, yes, I think logical induction might be a way to formalise this, though others would know much more about it
Thanks for commenting! This is an interesting question and answering it requires digging into some of the subtleties of causality. Unfortunately the time series framing you propose doesnt work because this time series data is not iid (the variable A = “the next number out of program 1” is not iid), while by definition the distributions P(A), P(B) and P(A,B) you are reasoning with are assuming iid. We really have to have iid here, otherwise we are trying to infer correlation from a single sample. By treating non-iid variables as iid we can see correlations where there are no correlations, but those correlations come from the fact that the next output depends on the previous output, not because the output of one program depends on the output of the other program.
We can fix this by imagining a slightly different setup that I think is faithful to your proposal. Basically the same thing but instead of computing pi, both the programs have in memory a random string of bits, with 0 or 1 occurring with probability 1⁄2 for each bit. Both programs just read out the string. Let the string of random bits be identical for program 1 and 2. Now, we can describe each output of the programs as iid. If these are the same for both program, the outputs of the programs are perfectly correlated. And you are right, by looking at the output of one of the programs I can update by beliefs on the output of the other program.
Then we need to ask, how do we generate this experiment? To get the string of random bits we have to sample a coin flip, and then make two copies of the outcome and send it to both programs. If we tried to do this with two coins separately at different ends of the universe, we would get diffrent bit strings. So the two programs have in their past light cones a shared source of randomness—this is the common cause.
I’d agree that the bits of output are not independent in some physical sense. But they’re definitely independent in my mind! If I hear that the 100th binary digit of pi is 1, then my subjective probability over the 101st digit does not update at all, and remains at 0.5/0.5. So this still feels like a frequentism/Bayesianism thing to me.
Re: the modified experiment about random strings, you say that “To get the string of random bits we have to sample a coin flip, and then make two copies of the outcome”. But there’s nothing preventing the universe from simply containing to copies of the same random string, created causally independently. But that’s also vanishingly unlikely as the string gets longer.
Yes I can flip two independent coins a finite number of times and get strings that appear to be correlated. But in the asymptotic limit the probability they are the same (or correlated at all) goes to zero. Hence, two causally unrelated things can appear dependent for finite sample sizes. But when we have infinite samples (which is the limit we assume when making statements about probabilities) we get P(a,b) = P(a)P(b).
It may be useful to know that if events all obey the Markov property (they are probability distributions, conditional on some set of causal parents), then the Reichenbach Common Cause Principle follows (by d-separation arguments) as a theorem. So any counterexamples to RCCP must violate the Markov property as well.
There’s also a lot of interesting discussion here.
You are right this is somewhat blistering, especially for this LW forum.
I would have been less controversial for the authors to say that ‘all statistical correlations can be modelled as casual influences’. Correlations between two observables can always be modelled as being caused by the causal dependence of both on the value of a certain third variable, which may (if the person making the model wants to) be defined as a hidden variable that cannot by definition be observed.
After is has been drawn up, such a causal model claiming that an observed statistical correlation is being caused by a causal dependency on a hidden variable might then be either confirmed or falsified, for certain values of confirmed or falsified that philosophers love to endlessly argue about, by 1) further observations or by 2) active experiment, an experiment where one does a causal intervention.
Pearl kind of leans towards 2) the active experiment route towards confirming or falsifying the model—deep down, one of the points Pearl makes is that experiments can be used to distinguish between correlation and causation, that this experimentalist route has been ignored too much by statisticians and Bayesian philosophers alike, and that this route has also been improperly maligned by the Cigarette industry and other merchants of doubt.
Another point Pearl makes is that Pearl causal models and Pearl counterfactuals are very useful of mathematical tools that could be used by ex-statisticians turned experimentalists when they try to understand, and/or make predictions about, nondeterministic systems with potentially hidden variables.
This latter point is mostly made by Pearl towards the medical community. But this point also applies to doing AI interpretability research.
When it comes to the more traditional software engineering and physical systems engineering communities, or the experimental physics community for that matter, most people in these communities intuitively understand Pearl’s point about the importance of doing causal intervention based experiments as being plain common sense. They understand this without ever having read the work or the arguments of Pearl first. These communities also use mathematical tools which are equivalent to using Pearl’s do() notation, usually without even knowing about this equivalence.