localdeity comments on Bayesian Networks Aren’t Necessarily Causal

localdeity 14 May 2023 16:19 UTC
7 points
0
First crucial point which this post is missing: the first (intuitively wrong) net reconstructed represents the probabilities using 9 parameters (i.e. the nine rows of the various truth tables), whereas the second (intuitively right) represents the probabilities using 8. That means the second model uses fewer bits; the distribution is more compressed by the model. So the “true” network is favored even before we get into interventions.
Is this always going to be the case? I feel like the answer is “not always”, but I have no empirical data or theoretical argument here.
- johnswentworth 21 May 2023 6:11 UTC
  4 points
  0
  Parent
  I’ve tried this before experimentally—i.e. code up a gaussian distribution with a graph structure, then check how well different graph structures compress the distribution. Modulo equivalent graph structures (e.g. A → B → C vs A ← B ← C vs A ← B → C), the true structure is pretty consistently favored.
  - tailcalled 21 May 2023 11:19 UTC
    4 points
    0
    Parent
    I don’t think this is much better than just linking up variables to each other if they are strongly correlated (at least in ways not explained by existing links)?
    - johnswentworth 21 May 2023 18:05 UTC
      2 points
      0
      Parent
      Just that does usually work pretty well for (at least a rough estimate of) the undirected graph structure, but then you don’t know the directions of any arrows.
      - tailcalled 21 May 2023 19:13 UTC
        4 points
        0
        Parent
        I think this approach only gets the direction of the arrows from two structures, which I’ll call colliders and instrumental variables (because that’s what they are usually called).
        Colliders are the case of A → B ← C, which in terms of correlations shows up as A and B being correlated, B and C being correlated, and A and C being independent. This is a distinct pattern of correlations from the A → B → C or A ← B → C structures where all three could be correlated, so it is possible for this method to distinguish the structures (well, sometimes not, but that’s tangential to my point).
        Instrumental variables are the case of A → B → C, where A → B is known but the direction of B—C is unknown. In that case, the fact that C correlates with A suggests that B → C rather than B ← C.
        I think the main advantage larger causal networks give you is that they give you more opportunities to apply these structures?
        But I see two issues with them. First, they don’t see to work very well in nondeterministic cases. They both rely on the correlation between A and C, and they both need to distinguish whether that correlation is 0 or $a b \cdot b c$ (where $a b$ and $b c$ refer to the effects A—B and B—C) respectively. If the effects in your causal network are of order $e$ , then you are basically trying to distinguish something of order 0 from something of order $e^{2}$ , which is likely going to be hard if $e$ is small. (The smaller of a difference you are trying to detect, the more affected you are going to be by model misspecification, unobserved confounders, measurement error, etc..) This is not a problem in Zack’s case because his effects are near-deterministic, but it would be a problem in other cases. (I in particularly have various social science applications in mind.)
        Secondly, Zack’s example had an advantage that multiple root causes of wet sidewalks were measured. This gave him a collider to kick off the inference process. (Though I actually suspect this to be unrealistic—wouldn’t you be less likely to turn on the sprinkler if it’s raining? But that relationship would be much less deterministic, so I suspect that’s OK in this case where it’s less deterministic.) But this seems very much like a luxury that often doesn’t occur in the practical cases where I’ve seen people attempt to apply this. (Again various social science applications.)
  - Adele Lopez 21 May 2023 7:45 UTC
    4 points
    0
    Parent
    Do you know exactly how strongly it favors the true (or equivalent) structure?
    - johnswentworth 21 May 2023 19:06 UTC
      4 points
      0
      Parent
      There’s an asymmetry between local differences from the true model which can’t match the true distribution (typically too few edges) and differences which can (typically too many edges). The former get about O(n) bits against them per local difference from the true model, the latter about O(log(n)), as the number of data points n grows.
      
      Conceptually, the story for the log(n) scaling is: with n data points, we can typically estimate each parameter to ~log(n) bit precision. So, an extra parameter costs ~log(n) bits.
    - [ ]
      [deleted]
- Zack_M_Davis 15 May 2023 5:23 UTC
  2 points
  0
  Parent
  I mean, it’s true as a special case of minimum description length epistemology favoring simpler models. Chapter 18 of the Koller and Friedman book has a section about the Bayesian score for model comparison, which has a positive term involving the mutual information between variables and their parents (rewarding “fit”), and a negative term for the number of parameters (penalizing complexity).
  
  What’s less clear to me (Wentworth likely knows more) is how closely that kind of formal model comparison corresponds to my intuitive sense of causality. The first network in this post intuitively feels very wrong—two backwards arrows and one spurious arrow—out of proportion to its complexity burden of one measly extra parameter. (Maybe the disparity scales up with non-tiny examples, though?)
  - johnswentworth 21 May 2023 6:07 UTC
    2 points
    0
    Parent
    (Maybe the disparity scales up with non-tiny examples, though?)
    Yup, that’s right.