In the example of the two programs, we have to be careful with what we mean by statistical correlation v.s. more standard / colloquial use of the term. Im assuming here when you say `the same program running on opposite ends of the universe, and their outputs would be the same’ that you are referring to a deterministic program (else, there would be no guarantee that the outputs were the same). But, if the output of the two programs is deterministic, then there can be no statistical correlation between them. Let A be the outcome of the first program and B the outcome of the second. To measure statistical correlation we have to run the two programs many times generating i.i.d. samples of A and B, and they are correlated if P(A, B) is not equal to P(A)P(B). But if the two programs are deterministic, say A = a and B = b with probability 1, then they are not statistically correlated, as P(A = a, B = b) = 1 and P(A = a)P(B = b) = 1. So to get some correlation the output of the programs have to be random. To have two random algorithms generating correlated outcomes, they need to share some randomness they can condition their outputs on, i.e. a common cause. With the two planets example, we run into the same problem again. (PS by correlation Reichenbach means statistical dependence here rather than e.g. Pearson correlation, but the same argument applies).
Broadly speaking, to point of the (perhaps confusing) reference in the article is to say that if we accept the laws of physics, then all the things we observe in the universe are ultimately generated by causal dynamics (e.g. the classical equations of motion being applied to some initial conditions.) We can always describe these causal dynamics + initial conditions using a causal model. So there is always `some causal model’ that describes our data.
Thanks for writing that out! I’ve enjoyed thinking this through some more.
I agree that, if you instantiated many copies of the program across the universe as your sampling method, or somehow otherwise “ran them many times”, then their outputs would be independent in the sense that P(A, B) = P(A, B). This also holds true if, on each run, there was some “local” error to the program’s otherwise deterministic output.
I had intended to be using the program’s output as a time series of bits, where we are considering the bits to be “sampling” from A and B. Let’s say it’s a program that outputs the binary digits of pi. I have no idea what the bits are (after the first few) but there is a sense in which P(A) = 0.5 for either A = 0 or A = 1, and at any timestep. The same is true for P(B). So P(A)P(B) = 0.25. But clearly P(A = 0, B = 0) = 0.5, and P(A = 0, B = 1) = 0, et cetera. So in that case, they’re not probabilistically independent, and therefore there is a correlation not due to a causal influence.
But this is in a Bayesian framing, where the probability isn’t a physical thing about the programs, it’s a thing inside my mind. So, while there is a common source of the correlation (my uncertainty over what the digits of pi are) it’s certainly not a “causal influence” on A and B.
This matters to me because, in the context of agent foundations and AI alignment, I want my probabilities to be representing my state of belief (or the agent’s state of belief).
I had intended to be using the program’s output as a time series of bits, where we are considering the bits to be “sampling” from A and B. Let’s say it’s a program that outputs the binary digits of pi. I have no idea what the bits are (after the first few) but there is a sense in which P(A) = 0.5 for either A = 0 or A = 1, and at any timestep. The same is true for P(B). So P(A)P(B) = 0.25. But clearly P(A = 0, B = 0) = 0.5, and P(A = 0, B = 1) = 0, et cetera. So in that case, they’re not probabilistically independent, and therefore there is a correlation not due to a causal influence.
Just to chip in on this: in the case you’re describing, the numbers are not statistically correlated, because they are not random in the statistics sense. They are only random given logical uncertainty.
When considering logical “random” variables, there might well be a common logical “cause” behind any correlation. But I don’t think we know how to properly formalise or talk about that yet. Perhaps one day we can articulate a logical version of Reichenbach’s principle :)
Yeah, I think I agree that the resolution here is something about how we should use these words. In practice I don’t find myself having to distinguish between “statistics” and “probability” and “uncertainty” all that often. But in this case I’d be happy to agree that “all statistical correlations are due to casual influences” given that we mean “statistical” in a more limited way than I usually think of it.
But I don’t think we know how to properly formalise or talk about that yet.
A group of LessWrong contributors has made a lot of progress on these ideas of logical uncertainty and (what I think they’re now calling) functional decision theory over the last 15ish years, although I don’t really follow it myself, so I’m not sure how close they’d say we are to having it properly formalized.
Thanks for commenting! This is an interesting question and answering it requires digging into some of the subtleties of causality. Unfortunately the time series framing you propose doesnt work because this time series data is not iid (the variable A = “the next number out of program 1” is not iid), while by definition the distributions P(A), P(B) and P(A,B) you are reasoning with are assuming iid. We really have to have iid here, otherwise we are trying to infer correlation from a single sample. By treating non-iid variables as iid we can see correlations where there are no correlations, but those correlations come from the fact that the next output depends on the previous output, not because the output of one program depends on the output of the other program.
We can fix this by imagining a slightly different setup that I think is faithful to your proposal. Basically the same thing but instead of computing pi, both the programs have in memory a random string of bits, with 0 or 1 occurring with probability 1⁄2 for each bit. Both programs just read out the string. Let the string of random bits be identical for program 1 and 2. Now, we can describe each output of the programs as iid. If these are the same for both program, the outputs of the programs are perfectly correlated. And you are right, by looking at the output of one of the programs I can update by beliefs on the output of the other program.
Then we need to ask, how do we generate this experiment? To get the string of random bits we have to sample a coin flip, and then make two copies of the outcome and send it to both programs. If we tried to do this with two coins separately at different ends of the universe, we would get diffrent bit strings. So the two programs have in their past light cones a shared source of randomness—this is the common cause.
I’d agree that the bits of output are not independent in some physical sense. But they’re definitely independent in my mind! If I hear that the 100th binary digit of pi is 1, then my subjective probability over the 101st digit does not update at all, and remains at 0.5/0.5. So this still feels like a frequentism/Bayesianism thing to me.
Re: the modified experiment about random strings, you say that “To get the string of random bits we have to sample a coin flip, and then make two copies of the outcome”. But there’s nothing preventing the universe from simply containing to copies of the same random string, created causally independently. But that’s also vanishingly unlikely as the string gets longer.
Yes I can flip two independent coins a finite number of times and get strings that appear to be correlated. But in the asymptotic limit the probability they are the same (or correlated at all) goes to zero. Hence, two causally unrelated things can appear dependent for finite sample sizes. But when we have infinite samples (which is the limit we assume when making statements about probabilities) we get P(a,b) = P(a)P(b).
In the example of the two programs, we have to be careful with what we mean by statistical correlation v.s. more standard / colloquial use of the term. Im assuming here when you say `the same program running on opposite ends of the universe, and their outputs would be the same’ that you are referring to a deterministic program (else, there would be no guarantee that the outputs were the same). But, if the output of the two programs is deterministic, then there can be no statistical correlation between them. Let A be the outcome of the first program and B the outcome of the second. To measure statistical correlation we have to run the two programs many times generating i.i.d. samples of A and B, and they are correlated if P(A, B) is not equal to P(A)P(B). But if the two programs are deterministic, say A = a and B = b with probability 1, then they are not statistically correlated, as P(A = a, B = b) = 1 and P(A = a)P(B = b) = 1. So to get some correlation the output of the programs have to be random. To have two random algorithms generating correlated outcomes, they need to share some randomness they can condition their outputs on, i.e. a common cause. With the two planets example, we run into the same problem again. (PS by correlation Reichenbach means statistical dependence here rather than e.g. Pearson correlation, but the same argument applies).
Broadly speaking, to point of the (perhaps confusing) reference in the article is to say that if we accept the laws of physics, then all the things we observe in the universe are ultimately generated by causal dynamics (e.g. the classical equations of motion being applied to some initial conditions.) We can always describe these causal dynamics + initial conditions using a causal model. So there is always `some causal model’ that describes our data.
Thanks for writing that out! I’ve enjoyed thinking this through some more.
I agree that, if you instantiated many copies of the program across the universe as your sampling method, or somehow otherwise “ran them many times”, then their outputs would be independent in the sense that P(A, B) = P(A, B). This also holds true if, on each run, there was some “local” error to the program’s otherwise deterministic output.
I had intended to be using the program’s output as a time series of bits, where we are considering the bits to be “sampling” from A and B. Let’s say it’s a program that outputs the binary digits of pi. I have no idea what the bits are (after the first few) but there is a sense in which P(A) = 0.5 for either A = 0 or A = 1, and at any timestep. The same is true for P(B). So P(A)P(B) = 0.25. But clearly P(A = 0, B = 0) = 0.5, and P(A = 0, B = 1) = 0, et cetera. So in that case, they’re not probabilistically independent, and therefore there is a correlation not due to a causal influence.
But this is in a Bayesian framing, where the probability isn’t a physical thing about the programs, it’s a thing inside my mind. So, while there is a common source of the correlation (my uncertainty over what the digits of pi are) it’s certainly not a “causal influence” on A and B.
This matters to me because, in the context of agent foundations and AI alignment, I want my probabilities to be representing my state of belief (or the agent’s state of belief).
Just to chip in on this: in the case you’re describing, the numbers are not statistically correlated, because they are not random in the statistics sense. They are only random given logical uncertainty.
When considering logical “random” variables, there might well be a common logical “cause” behind any correlation. But I don’t think we know how to properly formalise or talk about that yet. Perhaps one day we can articulate a logical version of Reichenbach’s principle :)
Yeah, I think I agree that the resolution here is something about how we should use these words. In practice I don’t find myself having to distinguish between “statistics” and “probability” and “uncertainty” all that often. But in this case I’d be happy to agree that “all statistical correlations are due to casual influences” given that we mean “statistical” in a more limited way than I usually think of it.
A group of LessWrong contributors has made a lot of progress on these ideas of logical uncertainty and (what I think they’re now calling) functional decision theory over the last 15ish years, although I don’t really follow it myself, so I’m not sure how close they’d say we are to having it properly formalized.
nice, yes, I think logical induction might be a way to formalise this, though others would know much more about it
Thanks for commenting! This is an interesting question and answering it requires digging into some of the subtleties of causality. Unfortunately the time series framing you propose doesnt work because this time series data is not iid (the variable A = “the next number out of program 1” is not iid), while by definition the distributions P(A), P(B) and P(A,B) you are reasoning with are assuming iid. We really have to have iid here, otherwise we are trying to infer correlation from a single sample. By treating non-iid variables as iid we can see correlations where there are no correlations, but those correlations come from the fact that the next output depends on the previous output, not because the output of one program depends on the output of the other program.
We can fix this by imagining a slightly different setup that I think is faithful to your proposal. Basically the same thing but instead of computing pi, both the programs have in memory a random string of bits, with 0 or 1 occurring with probability 1⁄2 for each bit. Both programs just read out the string. Let the string of random bits be identical for program 1 and 2. Now, we can describe each output of the programs as iid. If these are the same for both program, the outputs of the programs are perfectly correlated. And you are right, by looking at the output of one of the programs I can update by beliefs on the output of the other program.
Then we need to ask, how do we generate this experiment? To get the string of random bits we have to sample a coin flip, and then make two copies of the outcome and send it to both programs. If we tried to do this with two coins separately at different ends of the universe, we would get diffrent bit strings. So the two programs have in their past light cones a shared source of randomness—this is the common cause.
I’d agree that the bits of output are not independent in some physical sense. But they’re definitely independent in my mind! If I hear that the 100th binary digit of pi is 1, then my subjective probability over the 101st digit does not update at all, and remains at 0.5/0.5. So this still feels like a frequentism/Bayesianism thing to me.
Re: the modified experiment about random strings, you say that “To get the string of random bits we have to sample a coin flip, and then make two copies of the outcome”. But there’s nothing preventing the universe from simply containing to copies of the same random string, created causally independently. But that’s also vanishingly unlikely as the string gets longer.
Yes I can flip two independent coins a finite number of times and get strings that appear to be correlated. But in the asymptotic limit the probability they are the same (or correlated at all) goes to zero. Hence, two causally unrelated things can appear dependent for finite sample sizes. But when we have infinite samples (which is the limit we assume when making statements about probabilities) we get P(a,b) = P(a)P(b).