First crucial point which this post is missing: the first (intuitively wrong) net reconstructed represents the probabilities using 9 parameters (i.e. the nine rows of the various truth tables), whereas the second (intuitively right) represents the probabilities using 8. That means the second model uses fewer bits; the distribution is more compressed by the model. So the “true” network is favored even before we get into interventions.
Is this always going to be the case? I feel like the answer is “not always”, but I have no empirical data or theoretical argument here.
I’ve tried this before experimentally—i.e. code up a gaussian distribution with a graph structure, then check how well different graph structures compress the distribution. Modulo equivalent graph structures (e.g. A → B → C vs A ← B ← C vs A ← B → C), the true structure is pretty consistently favored.
I don’t think this is much better than just linking up variables to each other if they are strongly correlated (at least in ways not explained by existing links)?
Just that does usually work pretty well for (at least a rough estimate of) the undirected graph structure, but then you don’t know the directions of any arrows.
I think this approach only gets the direction of the arrows from two structures, which I’ll call colliders and instrumental variables (because that’s what they are usually called).
Colliders are the case of A → B ← C, which in terms of correlations shows up as A and B being correlated, B and C being correlated, and A and C being independent. This is a distinct pattern of correlations from the A → B → C or A ← B → C structures where all three could be correlated, so it is possible for this method to distinguish the structures (well, sometimes not, but that’s tangential to my point).
Instrumental variables are the case of A → B → C, where A → B is known but the direction of B—C is unknown. In that case, the fact that C correlates with A suggests that B → C rather than B ← C.
I think the main advantage larger causal networks give you is that they give you more opportunities to apply these structures?
But I see two issues with them. First, they don’t see to work very well in nondeterministic cases. They both rely on the correlation between A and C, and they both need to distinguish whether that correlation is 0 or ab⋅bc (where ab and bc refer to the effects A—B and B—C) respectively. If the effects in your causal network are of order e, then you are basically trying to distinguish something of order 0 from something of order e2, which is likely going to be hard if e is small. (The smaller of a difference you are trying to detect, the more affected you are going to be by model misspecification, unobserved confounders, measurement error, etc..) This is not a problem in Zack’s case because his effects are near-deterministic, but it would be a problem in other cases. (I in particularly have various social science applications in mind.)
Secondly, Zack’s example had an advantage that multiple root causes of wet sidewalks were measured. This gave him a collider to kick off the inference process. (Though I actually suspect this to be unrealistic—wouldn’t you be less likely to turn on the sprinkler if it’s raining? But that relationship would be much less deterministic, so I suspect that’s OK in this case where it’s less deterministic.) But this seems very much like a luxury that often doesn’t occur in the practical cases where I’ve seen people attempt to apply this. (Again various social science applications.)
There’s an asymmetry between local differences from the true model which can’t match the true distribution (typically too few edges) and differences which can (typically too many edges). The former get about O(n) bits against them per local difference from the true model, the latter about O(log(n)), as the number of data points n grows.
Conceptually, the story for the log(n) scaling is: with n data points, we can typically estimate each parameter to ~log(n) bit precision. So, an extra parameter costs ~log(n) bits.
I mean, it’s true as a special case of minimum description length epistemology favoring simpler models. Chapter 18 of the Koller and Friedman book has a section about the Bayesian score for model comparison, which has a positive term involving the mutual information between variables and their parents (rewarding “fit”), and a negative term for the number of parameters (penalizing complexity).
What’s less clear to me (Wentworth likely knows more) is how closely that kind of formal model comparison corresponds to my intuitive sense of causality. The first network in this post intuitively feels very wrong—two backwards arrows and one spurious arrow—out of proportion to its complexity burden of one measly extra parameter. (Maybe the disparity scales up with non-tiny examples, though?)
Is this always going to be the case? I feel like the answer is “not always”, but I have no empirical data or theoretical argument here.
I’ve tried this before experimentally—i.e. code up a gaussian distribution with a graph structure, then check how well different graph structures compress the distribution. Modulo equivalent graph structures (e.g. A → B → C vs A ← B ← C vs A ← B → C), the true structure is pretty consistently favored.
I don’t think this is much better than just linking up variables to each other if they are strongly correlated (at least in ways not explained by existing links)?
Just that does usually work pretty well for (at least a rough estimate of) the undirected graph structure, but then you don’t know the directions of any arrows.
I think this approach only gets the direction of the arrows from two structures, which I’ll call colliders and instrumental variables (because that’s what they are usually called).
Colliders are the case of A → B ← C, which in terms of correlations shows up as A and B being correlated, B and C being correlated, and A and C being independent. This is a distinct pattern of correlations from the A → B → C or A ← B → C structures where all three could be correlated, so it is possible for this method to distinguish the structures (well, sometimes not, but that’s tangential to my point).
Instrumental variables are the case of A → B → C, where A → B is known but the direction of B—C is unknown. In that case, the fact that C correlates with A suggests that B → C rather than B ← C.
I think the main advantage larger causal networks give you is that they give you more opportunities to apply these structures?
But I see two issues with them. First, they don’t see to work very well in nondeterministic cases. They both rely on the correlation between A and C, and they both need to distinguish whether that correlation is 0 or ab⋅bc (where ab and bc refer to the effects A—B and B—C) respectively. If the effects in your causal network are of order e, then you are basically trying to distinguish something of order 0 from something of order e2, which is likely going to be hard if e is small. (The smaller of a difference you are trying to detect, the more affected you are going to be by model misspecification, unobserved confounders, measurement error, etc..) This is not a problem in Zack’s case because his effects are near-deterministic, but it would be a problem in other cases. (I in particularly have various social science applications in mind.)
Secondly, Zack’s example had an advantage that multiple root causes of wet sidewalks were measured. This gave him a collider to kick off the inference process. (Though I actually suspect this to be unrealistic—wouldn’t you be less likely to turn on the sprinkler if it’s raining? But that relationship would be much less deterministic, so I suspect that’s OK in this case where it’s less deterministic.) But this seems very much like a luxury that often doesn’t occur in the practical cases where I’ve seen people attempt to apply this. (Again various social science applications.)
Do you know exactly how strongly it favors the true (or equivalent) structure?
There’s an asymmetry between local differences from the true model which can’t match the true distribution (typically too few edges) and differences which can (typically too many edges). The former get about O(n) bits against them per local difference from the true model, the latter about O(log(n)), as the number of data points n grows.
Conceptually, the story for the log(n) scaling is: with n data points, we can typically estimate each parameter to ~log(n) bit precision. So, an extra parameter costs ~log(n) bits.
I mean, it’s true as a special case of minimum description length epistemology favoring simpler models. Chapter 18 of the Koller and Friedman book has a section about the Bayesian score for model comparison, which has a positive term involving the mutual information between variables and their parents (rewarding “fit”), and a negative term for the number of parameters (penalizing complexity).
What’s less clear to me (Wentworth likely knows more) is how closely that kind of formal model comparison corresponds to my intuitive sense of causality. The first network in this post intuitively feels very wrong—two backwards arrows and one spurious arrow—out of proportion to its complexity burden of one measly extra parameter. (Maybe the disparity scales up with non-tiny examples, though?)
Yup, that’s right.