we see that, indeed, the correlations are what we expect (0 is uncorrelated with 2, 3, or 4 because there is no path from 0 to 2, 3, or 4 through the graph). Note that the diagonal is zeroed out.
For the next step we are going to look at the log of the correlations, to demonstrate that they are nonzero even in the cases where there is no causal connection between variables:
We determine the adjacency matrix and then use that to determine the not_influence pairs, that is, the nodes where the first node does not affect the second
In order to fix this, we have to fix our influence matrix. A node src is causally linked to a node dst if the set of src and all its ancestors intersects with the set of dst and all its ancestors. In English “A is causally linked to B if A causes B, B causes A, or some common thing C causes both A and B”.
influence_fixed = np.zeros((len(sem.nodes), len(sem.nodes)), dtype=bool)
for i in list(sem.nodes):
for j in list(sem.nodes)[i+1:]:
ancestry_i = nx.ancestors(sem, i).union({i})
ancestry_j = nx.ancestors(sem, j).union({j})
if len(ancestry_i.intersection(ancestry_j)) > 0:
influence_fixed[j,i] = True
Looking again at the graph, this looks right.
Node 0 directly affects node 1 and indirectly affects 5
Node 1 directly affects node 5
Node 2 directly affects 3 and 4 and indirectly affects 5
Node 3 shares a common cause (2) with nodes 4 and 5
Yes, thank you. Strong upvote. In a previous version I’d had I noticed this yesterday before going to sleep, let me fix the code and run it again (and update the post).
In a previous version I did use actual causal influence, not just direct nodes. The fixed code then is
Interestingly, this makes the number of causal relationships higher, and should therefore also increase the number of causal non-correlations (which is also what now running the code suggests).
I think the transitive closure captures “A is causally upstream of B” and “B is causally upstream of A” but not “Some common thing C is causally upstream of both A and B”. Going back to the example in the above post:
if I do
sem = nx.DiGraph()
coefficients = {(0, 1): -0.371, (1, 5): +0.685, (2, 3): +1.139, (2, 4): -0.332, (4, 5): +0.580}
sem.add_nodes_from(range(6))
for (src, dst), weight in coefficients.items():
sem.add_edge(src, dst, label=f'{weight:+.2f}')
print([(i, j) for i, j in nx.transitive_closure(sem).edges()])
However, we would expect nonzero correlation between the values of e.g. node 3 and node 4, because both 3 and 4 are causally downstream of 2. However, the transitive closure is missing (3, 4) and (3, 5).
There might be a cleaner and more “mathy” way of saying “all pairs of nodes a and b such that the intersection of (a and all a‘s ancestors) and (b and all b’s ancestors) is non-empty”, but if there is I don’t know the math term for it. Still, I think that is the construct you need here.
If some common variable C is causally upstream both of A and B, then I wouldn’t say that A causes B, or B causes A—intervening on A can’t possibly change B, and intervening on B can’t change A (which is the understanding of causation by Pearl).
If some common variable C is causally upstream both of A and B, then I wouldn’t say that A causes B, or B causes A—intervening on A can’t possibly change B, and intervening on B can’t change A (which is the understanding of causation by Pearl).
I agree with this. And yet.
I, however, have an inner computer scientist. And he demands answers. He will not rest until he knows how often ¬Correlation ⇒ ¬Causation, and how often it doesn’t. [...] Let’s take all the correlations between variables which don’t have any causal relationship. The largest of those is the “largest uncaused correlation”. Correlations between two variables which cause each other but are smaller than the largest uncaused correlation are “too small”: There is a causation but it’s not detected.
The issue with that is that your “largest uncaused correlation” can be arbitrarily large—if you’ve got some common factor C that explains 99% of the variance in downstream things A and B, but A does not affect B and B does not affect A, your largest uncaused correlation is going to be > 0.9 and as such you’ll think that any correlations less than 0.9 are fake / undetected.
Let’s make the above diagram concrete:
Let’s consider the following causal influence graph
0: Past 24-hour rainfall at SEA-TAC airport
1: Electricity spot price in Seattle (Seattle gets 80% of its electricity from hydro power)
2: Average electric car cost in US, USD
3: Total value of vehicle registration fees collected in California (California charges an amount proportional to the value of the car)
4: Fraction of households with an electric car in Seattle
5: Average household electric bill in Seattle
Changing the total value of vehicle registration fees collected in California will not affect the fraction of households with an electric car in Seattle, nor will changing the fraction of households with an electric car in Seattle affect the total value of vehicle registration fees collected in California. And yet we expect a robust correlation between those two.
Whether or not we can tell that past 24 hour rainfall causes changes in the spot price of electricity should not depend on the relationship between vehicle registration fees in California and electric vehicle ownership in Seattle.
The issue is trying to use an adjacency matrix as a causal influence matrix.
Let’s say you have a graph with the following coefficients:
which corresponds to a graph that looks like this
Working through the code step by step, with visualizations:
we see that, indeed, the correlations are what we expect (0 is uncorrelated with 2, 3, or 4 because there is no path from 0 to 2, 3, or 4 through the graph). Note that the diagonal is zeroed out.
For the next step we are going to look at the log of the correlations, to demonstrate that they are nonzero even in the cases where there is no causal connection between variables:
We determine the adjacency matrix and then use that to determine the
not_influence
pairs, that is, the nodes where the first node does not affect the secondand
we see that, according to
not_influence
, node 0 has no effect on node 5.So now we do
and see
In order to fix this, we have to fix our influence matrix. A node
src
is causally linked to a nodedst
if the set ofsrc
and all its ancestors intersects with the set ofdst
and all its ancestors. In English “A is causally linked to B if A causes B, B causes A, or some common thing C causes both A and B”.Looking again at the graph, this looks right.
Node 0 directly affects node 1 and indirectly affects 5
Node 1 directly affects node 5
Node 2 directly affects 3 and 4 and indirectly affects 5
Node 3 shares a common cause (2) with nodes 4 and 5
Node 4 directly affects node 5
Node 5 is the most-downstream node
and now redrawing the correlations
Yes, thank you. Strong upvote. In a previous version I’d had I noticed this yesterday before going to sleep, let me fix the code and run it again (and update the post).
In a previous version I did use actual causal influence, not just direct nodes. The fixed code then is
instead of
Interestingly, this makes the number of causal relationships higher, and should therefore also increase the number of causal non-correlations (which is also what now running the code suggests).
I think the transitive closure captures “A is causally upstream of B” and “B is causally upstream of A” but not “Some common thing C is causally upstream of both A and B”. Going back to the example in the above post:
if I do
then I get
However, we would expect nonzero correlation between the values of e.g. node
3
and node4
, because both3
and4
are causally downstream of2
. However, the transitive closure is missing(3, 4)
and(3, 5)
.There might be a cleaner and more “mathy” way of saying “all pairs of nodes
a
andb
such that the intersection of (a
and alla
‘s ancestors) and (b
and allb
’s ancestors) is non-empty”, but if there is I don’t know the math term for it. Still, I think that is the construct you need here.If some common variable C is causally upstream both of A and B, then I wouldn’t say that A causes B, or B causes A—intervening on A can’t possibly change B, and intervening on B can’t change A (which is the understanding of causation by Pearl).
I agree with this. And yet.
The issue with that is that your “largest uncaused correlation” can be arbitrarily large—if you’ve got some common factor C that explains 99% of the variance in downstream things A and B, but A does not affect B and B does not affect A, your largest uncaused correlation is going to be > 0.9 and as such you’ll think that any correlations less than 0.9 are fake / undetected.
Let’s make the above diagram concrete:
Let’s consider the following causal influence graph
0: Past 24-hour rainfall at SEA-TAC airport
1: Electricity spot price in Seattle (Seattle gets 80% of its electricity from hydro power)
2: Average electric car cost in US, USD
3: Total value of vehicle registration fees collected in California (California charges an amount proportional to the value of the car)
4: Fraction of households with an electric car in Seattle
5: Average household electric bill in Seattle
Changing the total value of vehicle registration fees collected in California will not affect the fraction of households with an electric car in Seattle, nor will changing the fraction of households with an electric car in Seattle affect the total value of vehicle registration fees collected in California. And yet we expect a robust correlation between those two.
Whether or not we can tell that past 24 hour rainfall causes changes in the spot price of electricity should not depend on the relationship between vehicle registration fees in California and electric vehicle ownership in Seattle.