My suspicion as to why this took so long to develop is that it’s worthless when looking at graphs with only two nodes:
there, we can only tell the difference between independence and correlation, and there’s no way to tell which way the
causation goes.
Yes, but they contain less information. Check out figure 2 of the Peters paper (which describes discrete distributions). If you have an additive noise model, so Y is X plus noise, then by looking at the joint pdf you can distinguish between X causing Y and Y causing X by the corners. This doesn’t seem possible if X and Y can only have 2 values (since you get a square, not a trapezoid).
Well, actually...
http://jmlr.csail.mit.edu/papers/volume7/shimizu06a/shimizu06a.pdf http://jmlr.csail.mit.edu/proceedings/papers/v9/peters10a/peters10a.pdf
Fascinating; thanks for the papers! Those look like they describe continuous and discrete distributions; does my statement hold for binary variables?
Aren’t binary variables a discrete distribution?
Yes, but they contain less information. Check out figure 2 of the Peters paper (which describes discrete distributions). If you have an additive noise model, so Y is X plus noise, then by looking at the joint pdf you can distinguish between X causing Y and Y causing X by the corners. This doesn’t seem possible if X and Y can only have 2 values (since you get a square, not a trapezoid).