I’ve tried this before experimentally—i.e. code up a gaussian distribution with a graph structure, then check how well different graph structures compress the distribution. Modulo equivalent graph structures (e.g. A → B → C vs A ← B ← C vs A ← B → C), the true structure is pretty consistently favored.
I don’t think this is much better than just linking up variables to each other if they are strongly correlated (at least in ways not explained by existing links)?
Just that does usually work pretty well for (at least a rough estimate of) the undirected graph structure, but then you don’t know the directions of any arrows.
I think this approach only gets the direction of the arrows from two structures, which I’ll call colliders and instrumental variables (because that’s what they are usually called).
Colliders are the case of A → B ← C, which in terms of correlations shows up as A and B being correlated, B and C being correlated, and A and C being independent. This is a distinct pattern of correlations from the A → B → C or A ← B → C structures where all three could be correlated, so it is possible for this method to distinguish the structures (well, sometimes not, but that’s tangential to my point).
Instrumental variables are the case of A → B → C, where A → B is known but the direction of B—C is unknown. In that case, the fact that C correlates with A suggests that B → C rather than B ← C.
I think the main advantage larger causal networks give you is that they give you more opportunities to apply these structures?
But I see two issues with them. First, they don’t see to work very well in nondeterministic cases. They both rely on the correlation between A and C, and they both need to distinguish whether that correlation is 0 or ab⋅bc (where ab and bc refer to the effects A—B and B—C) respectively. If the effects in your causal network are of order e, then you are basically trying to distinguish something of order 0 from something of order e2, which is likely going to be hard if e is small. (The smaller of a difference you are trying to detect, the more affected you are going to be by model misspecification, unobserved confounders, measurement error, etc..) This is not a problem in Zack’s case because his effects are near-deterministic, but it would be a problem in other cases. (I in particularly have various social science applications in mind.)
Secondly, Zack’s example had an advantage that multiple root causes of wet sidewalks were measured. This gave him a collider to kick off the inference process. (Though I actually suspect this to be unrealistic—wouldn’t you be less likely to turn on the sprinkler if it’s raining? But that relationship would be much less deterministic, so I suspect that’s OK in this case where it’s less deterministic.) But this seems very much like a luxury that often doesn’t occur in the practical cases where I’ve seen people attempt to apply this. (Again various social science applications.)
There’s an asymmetry between local differences from the true model which can’t match the true distribution (typically too few edges) and differences which can (typically too many edges). The former get about O(n) bits against them per local difference from the true model, the latter about O(log(n)), as the number of data points n grows.
Conceptually, the story for the log(n) scaling is: with n data points, we can typically estimate each parameter to ~log(n) bit precision. So, an extra parameter costs ~log(n) bits.
I’ve tried this before experimentally—i.e. code up a gaussian distribution with a graph structure, then check how well different graph structures compress the distribution. Modulo equivalent graph structures (e.g. A → B → C vs A ← B ← C vs A ← B → C), the true structure is pretty consistently favored.
I don’t think this is much better than just linking up variables to each other if they are strongly correlated (at least in ways not explained by existing links)?
Just that does usually work pretty well for (at least a rough estimate of) the undirected graph structure, but then you don’t know the directions of any arrows.
I think this approach only gets the direction of the arrows from two structures, which I’ll call colliders and instrumental variables (because that’s what they are usually called).
Colliders are the case of A → B ← C, which in terms of correlations shows up as A and B being correlated, B and C being correlated, and A and C being independent. This is a distinct pattern of correlations from the A → B → C or A ← B → C structures where all three could be correlated, so it is possible for this method to distinguish the structures (well, sometimes not, but that’s tangential to my point).
Instrumental variables are the case of A → B → C, where A → B is known but the direction of B—C is unknown. In that case, the fact that C correlates with A suggests that B → C rather than B ← C.
I think the main advantage larger causal networks give you is that they give you more opportunities to apply these structures?
But I see two issues with them. First, they don’t see to work very well in nondeterministic cases. They both rely on the correlation between A and C, and they both need to distinguish whether that correlation is 0 or ab⋅bc (where ab and bc refer to the effects A—B and B—C) respectively. If the effects in your causal network are of order e, then you are basically trying to distinguish something of order 0 from something of order e2, which is likely going to be hard if e is small. (The smaller of a difference you are trying to detect, the more affected you are going to be by model misspecification, unobserved confounders, measurement error, etc..) This is not a problem in Zack’s case because his effects are near-deterministic, but it would be a problem in other cases. (I in particularly have various social science applications in mind.)
Secondly, Zack’s example had an advantage that multiple root causes of wet sidewalks were measured. This gave him a collider to kick off the inference process. (Though I actually suspect this to be unrealistic—wouldn’t you be less likely to turn on the sprinkler if it’s raining? But that relationship would be much less deterministic, so I suspect that’s OK in this case where it’s less deterministic.) But this seems very much like a luxury that often doesn’t occur in the practical cases where I’ve seen people attempt to apply this. (Again various social science applications.)
Do you know exactly how strongly it favors the true (or equivalent) structure?
There’s an asymmetry between local differences from the true model which can’t match the true distribution (typically too few edges) and differences which can (typically too many edges). The former get about O(n) bits against them per local difference from the true model, the latter about O(log(n)), as the number of data points n grows.
Conceptually, the story for the log(n) scaling is: with n data points, we can typically estimate each parameter to ~log(n) bit precision. So, an extra parameter costs ~log(n) bits.