I am still confused about how you can tell the causal direction from just the raw data. Taking your toy example I could write a code that first randomly determines Y, and then decides on X, in such a way so that the probability table you give is produced. Presumably you mean something more specific by causal direction than I am imagining?
I got interested and wrote my python example. I get (up to some noise) the same distribution, and at least in my layperson way of looking at it I would say that Y has a causal effect on X.
import random
rand = random.random
def get_xy(): # Determine Y first, make it causally effect X if rand() > 82/100: Y = 1 else: Y = 0 # end if
if Y == 1: if rand() > 1/2: X = 1 else: X = 0 # end sub-if else: if rand() > 1/82: X = 1 # When Y=0 it is very likely X=1. else: X = 0 # end sub-if # end if return X, Y
dat = [0, 0, 0, 0] for _ in range(1000000): X, Y = get_xy() dat[X + 2*Y] += 1
This is a very good point, and you are right—Y causes X here, and we still get the stated distribution. The reason that we rule this case out is that the set of probability distributions in which Y causes X is a null set, i.e. it has measure zero.
If we assume the graph Y->X, and generate the data by choosing Y first and then X, like you did in your code—then it depends on the exact values of P(X|Y) whether X⊥Z holds. If the values of P(X|Y) change just slightly, then X⊥Z won’t hold anymore. So given that our graph is Y->X, it’s really unlikely (It will happen almost never) that we get a distribution in which X⊥Z holds. But as we did observe such a distribution with X⊥Z , we infer that the graph is not Y->X.
In contrast, if we assume the graph as X-> Y <-Z , then X⊥Z will hold for any distribution that works with our graph, no matter what the exact values of P(Y|X) are (as long as they correspond to the graph, which includes satisfying X⊥Z)
Thank you very much. That explains a lot. To repeat in my own words for my understanding: In my example perturbing any of the probabilities, even slightly, would upset the independence of Z and X. So in some sense their independence is a fine tuned coincidence, engineered by the choice of values. The model assumes that when independences are seen they are not coincidences in this way, but arise from the causal structure itself. And this assumption leads to the conclusion that X comes before Y.
I’m not quite convinced by this response. Would it be possible to formalize “set of probability distributions in which Y causes X is a null set, i.e. it has measure zero.”?
It is true that if the graph was (Y->X, X->Z, Y->Z), then we would violate faithfulness. There are results that show that under some assumptions, faithfulness is only violated with probability 0. But those assumptions do not seem to hold in this example.
Would it be possible to formalize “set of probability distributions in which Y causes X is a null set, i.e. it has measure zero.”?
We are looking at the space of conditional probability table (CPT) parametrizations in which the indepencies of our given joint probability distribution (JPD) hold.
If Y causes X, the independencies of our JPD only hold for a specific combination of conditional probabilities. Namely those in which P(X,Z) = P(X)P(Z). The set of CPT parametrizations with P(X,Z) = P(X)P(Z) has measure zero (it is lower-dimensional than the space of all CPT parametrizations).
In contrast, any CPT parametrization of graph 1 would fulfill the independencies of our given JPD. So the set of CPT parametrizations in which our independencies hold has a non-zero measure.
There are results that show that under some assumptions, faithfulness is only violated with probability 0. But those assumptions do not seem to hold in this example.
Could you say what these assumptions are that don’t hold here?
For example, in theorem 3.2 in Causation, Prediction, and Search, we have a result that says that faithfulness holds with probability 1 if we have a linear model with coefficients drawn randomly from distributions with positive densities.
It is not clear to me why we should expect faithfulness to hold in a situation like this, where Z is constructed from other variables with a particular purpose in mind.
Consider the graph Y<-X->Z. If I set Y:=X and Z:=X, we have that X⊥Y|Z, violating faithfulness. How are you sure that you don’t violate faithfulness by constructing Z?
Consider the graph Y<-X->Z. If I set Y:=X and Z:=X
I would say then the graph is reduced to the graph with just one node, namely X. And faithfulness is not violated because we wouldn’t expect X⊥X|X to hold.
In contrast, the graph X-> Y ← Z does not reduce straightforwardly even though Y is deterministic given X and Z, because there are no two variables which are information equivalent.
I’m not completely sure though if it reduces in a different way, because Y and {X, Z} are information equivalent (just like X and {Y,Z}, as well as Z and {X,Y}). And I agree that the conditions for theorem 3.2. aren’t fulfilled in this case. Intuitively I’d still say X-> Y ← Z doesn’t reduce, but I’d probably need to go through this paper more closely in order to be sure.
Finally got around to looking at this. I didn’t read the paper carefully, so I may have missed something, but I could not find anything that makes me more at ease with this conclusion.
Ben has already shown that it is perfectly possible that Y causes X. If this is somehow less likely that X causes Y, this is exactly what needs to be made precise. If faithfulness is the assumption that makes this work, then we need to show that faithfulness is a reasonable assumption in this example. It seems that this work has not been done?
If we can find the precise and reasonable assumptions that exclude that Y causes X, that would be super interesting.
Very interesting.
I am still confused about how you can tell the causal direction from just the raw data. Taking your toy example I could write a code that first randomly determines Y, and then decides on X, in such a way so that the probability table you give is produced. Presumably you mean something more specific by causal direction than I am imagining?
I got interested and wrote my python example. I get (up to some noise) the same distribution, and at least in my layperson way of looking at it I would say that Y has a causal effect on X.
import random
rand = random.random
def get_xy():
# Determine Y first, make it causally effect X
if rand() > 82/100:
Y = 1
else:
Y = 0
# end if
if Y == 1:
if rand() > 1/2:
X = 1
else:
X = 0
# end sub-if
else:
if rand() > 1/82:
X = 1 # When Y=0 it is very likely X=1.
else:
X = 0
# end sub-if
# end if
return X, Y
dat = [0, 0, 0, 0]
for _ in range(1000000):
X, Y = get_xy()
dat[X + 2*Y] += 1
print(dat)
>> [10102, 809880, 90204, 89814]
This is a very good point, and you are right—Y causes X here, and we still get the stated distribution. The reason that we rule this case out is that the set of probability distributions in which Y causes X is a null set, i.e. it has measure zero.
If we assume the graph Y->X, and generate the data by choosing Y first and then X, like you did in your code—then it depends on the exact values of P(X|Y) whether X⊥Z holds. If the values of P(X|Y) change just slightly, then X⊥Z won’t hold anymore. So given that our graph is Y->X, it’s really unlikely (It will happen almost never) that we get a distribution in which X⊥Z holds. But as we did observe such a distribution with X⊥Z , we infer that the graph is not Y->X.
In contrast, if we assume the graph as X-> Y <-Z , then X⊥Z will hold for any distribution that works with our graph, no matter what the exact values of P(Y|X) are (as long as they correspond to the graph, which includes satisfying X⊥Z)
Thank you very much. That explains a lot. To repeat in my own words for my understanding: In my example perturbing any of the probabilities, even slightly, would upset the independence of Z and X. So in some sense their independence is a fine tuned coincidence, engineered by the choice of values. The model assumes that when independences are seen they are not coincidences in this way, but arise from the causal structure itself. And this assumption leads to the conclusion that X comes before Y.
Yes, exactly!
I’m not quite convinced by this response. Would it be possible to formalize “set of probability distributions in which Y causes X is a null set, i.e. it has measure zero.”?
It is true that if the graph was (Y->X, X->Z, Y->Z), then we would violate faithfulness. There are results that show that under some assumptions, faithfulness is only violated with probability 0. But those assumptions do not seem to hold in this example.
We are looking at the space of conditional probability table (CPT) parametrizations in which the indepencies of our given joint probability distribution (JPD) hold.
If Y causes X, the independencies of our JPD only hold for a specific combination of conditional probabilities. Namely those in which P(X,Z) = P(X)P(Z). The set of CPT parametrizations with P(X,Z) = P(X)P(Z) has measure zero (it is lower-dimensional than the space of all CPT parametrizations).
In contrast, any CPT parametrization of graph 1 would fulfill the independencies of our given JPD. So the set of CPT parametrizations in which our independencies hold has a non-zero measure.
Could you say what these assumptions are that don’t hold here?
I’m pretty sure that you can prove that finite factored sets have this property directly, actually!
For example, in theorem 3.2 in Causation, Prediction, and Search, we have a result that says that faithfulness holds with probability 1 if we have a linear model with coefficients drawn randomly from distributions with positive densities.
It is not clear to me why we should expect faithfulness to hold in a situation like this, where Z is constructed from other variables with a particular purpose in mind.
Consider the graph Y<-X->Z. If I set Y:=X and Z:=X, we have that X⊥Y|Z, violating faithfulness. How are you sure that you don’t violate faithfulness by constructing Z?
I would say then the graph is reduced to the graph with just one node, namely X. And faithfulness is not violated because we wouldn’t expect X⊥X|X to hold.
In contrast, the graph X-> Y ← Z does not reduce straightforwardly even though Y is deterministic given X and Z, because there are no two variables which are information equivalent.
I’m not completely sure though if it reduces in a different way, because Y and {X, Z} are information equivalent (just like X and {Y,Z}, as well as Z and {X,Y}). And I agree that the conditions for theorem 3.2. aren’t fulfilled in this case. Intuitively I’d still say X-> Y ← Z doesn’t reduce, but I’d probably need to go through this paper more closely in order to be sure.
Finally got around to looking at this. I didn’t read the paper carefully, so I may have missed something, but I could not find anything that makes me more at ease with this conclusion.
Ben has already shown that it is perfectly possible that Y causes X. If this is somehow less likely that X causes Y, this is exactly what needs to be made precise. If faithfulness is the assumption that makes this work, then we need to show that faithfulness is a reasonable assumption in this example. It seems that this work has not been done?
If we can find the precise and reasonable assumptions that exclude that Y causes X, that would be super interesting.