Hi, thanks for this. I agree that this choice was not arbitrary at all!
There are a few related reasons why it was made.
(a) Pearl wisely noted that it is independences that we exploit for things like propagating beliefs around a sparse graph in polynomial time. When he was still arguing for the use of probability in AI, people in AI were still not fully on board, because they thought that to probabilistically reason about n binary variables we need a 2^n table for the joint, which is a non-starter (of course statisticians were on board w/ probability for hundreds of years even though they didn’t have computers—their solution was to use clever parametric models. In some sense Bayesian networks are just another kind of clever parametric model that finally penetrated the AI culture in the late 80s).
(b) We can define statistical (causal) models by either independences or dependences, but there is a lack of symmetry here that the symmetry of the “presence or absence of edges in a graph” masks. An independence is about a small part of the parameter space. That is, a model defined by an independence will correspond to a manifold of smaller dimension generally that sits in a space corresponding to a saturated model (no constraints). A model defined by dependences will just be that same space with a “small part” missing. Lowering dimension in a model is really nice in stats for a number of reasons.
(c) While conceivably we might be interested in a presence of a causal effect more than an absence of a causal effect, you are absolutely right that generally assumptions that allow us to equate a causal effect with some functional of observed data take the form of equality constraints (e.g. “independences in something.”) So it is much more useful to represent that even if we care about the presence of an effect at the end of the day. We can just see how far from null the final effect number is—we don’t need a graphical representation. However a graphical representation for assumptions we are exploiting to get the effect as a functional of observed data is very handy—this is what eventually led Jin Tian to his awesome identification algorithm on graphs.
(d) There is an interesting logical structure to conditional independence, e.g. Phil Dawid’s graphoid axioms. There is something like that for dependences (Armstrong’s axioms for functional dependence in db theory?) but the structure isn’t as rich.
edit: there are actually only two semi-graphoids : one for symmetry and one for chain rule.
edit^2: graphoids are not complete (because conditional independence is actually kind of a nasty relation). But at least it’s a ternary relation. There are far worse dragons in the cave of “equality constraints.”
Hi, thanks for this. I agree that this choice was not arbitrary at all!
There are a few related reasons why it was made.
(a) Pearl wisely noted that it is independences that we exploit for things like propagating beliefs around a sparse graph in polynomial time. When he was still arguing for the use of probability in AI, people in AI were still not fully on board, because they thought that to probabilistically reason about n binary variables we need a 2^n table for the joint, which is a non-starter (of course statisticians were on board w/ probability for hundreds of years even though they didn’t have computers—their solution was to use clever parametric models. In some sense Bayesian networks are just another kind of clever parametric model that finally penetrated the AI culture in the late 80s).
(b) We can define statistical (causal) models by either independences or dependences, but there is a lack of symmetry here that the symmetry of the “presence or absence of edges in a graph” masks. An independence is about a small part of the parameter space. That is, a model defined by an independence will correspond to a manifold of smaller dimension generally that sits in a space corresponding to a saturated model (no constraints). A model defined by dependences will just be that same space with a “small part” missing. Lowering dimension in a model is really nice in stats for a number of reasons.
(c) While conceivably we might be interested in a presence of a causal effect more than an absence of a causal effect, you are absolutely right that generally assumptions that allow us to equate a causal effect with some functional of observed data take the form of equality constraints (e.g. “independences in something.”) So it is much more useful to represent that even if we care about the presence of an effect at the end of the day. We can just see how far from null the final effect number is—we don’t need a graphical representation. However a graphical representation for assumptions we are exploiting to get the effect as a functional of observed data is very handy—this is what eventually led Jin Tian to his awesome identification algorithm on graphs.
(d) There is an interesting logical structure to conditional independence, e.g. Phil Dawid’s graphoid axioms. There is something like that for dependences (Armstrong’s axioms for functional dependence in db theory?) but the structure isn’t as rich.
edit: there are actually only two semi-graphoids : one for symmetry and one for chain rule.
edit^2: graphoids are not complete (because conditional independence is actually kind of a nasty relation). But at least it’s a ternary relation. There are far worse dragons in the cave of “equality constraints.”