But this same move can alternatively be done with the Frankenstein rule, right? (I might be missing something.) But Frankenstein has no such additional requirement, as stated. If I’m not missing something, I think Frankenstein might be invalid as stated (like maybe it needs an analogous extra condition). Haven’t thought this through yet.
Yeah, thinking slightly aloud, I tentatively think Frankenstein needs an extra condition like the blanket stitch condition… something which enforces the choice of topo ordering to be within the right class of topo orderings? That’s what the chain M←X→N does—it means we can assign orderings M,X¯i,Xi,N or M,Xi,X¯i,N, but not e.g. M,X¯i,N,Xi, even though that order is consistent with both of the other original graphs.
If I get some time I’ll return to this and think harder but I can’t guarantee it.
ETA I did spend a bit more time, and the below mostly resolves it: I was indeed missing something, and Frankenstein indeed needs an extra condition, but you do need M←X→N.
I had another look at this with a fresh brain and it was clearer what was happening.
TL;DR: It was both of ‘I’m missing something’, and a little bit ‘Frankenstein is invalid’ (it needs an extra condition which is sort of implicit in the post). As I guessed, with a little extra bookkeeping, we don’t need Stitching for the end-to-end proof. I’m also fairly confident Frankenstein subsumes Stitching in the general case. A ‘deductive system’ lens makes this all clearer (for me).
My Frankenstein mistake
The key invalid move I was making when I said
But this same move can alternatively be done with the Frankenstein rule, right?
is that Frankenstein requires all graphs to be over the same set of variables. This is kind of implicit in the post, but I don’t see it spelled out. I was applying it to an (M,X¯i,Xi) graph (N absent) and an (X¯i,Xi,N) graph (M absent). No can do!
Skipping Stitch in the end-to-end proof
I was right though, Frankenstein can be applied. But we first have to do ‘Node Introduction’ and ‘Expansion’ on the graphs to make them compatible (these extra bookkeeping rules are detailed further below.)
So, to get myself in a position to apply Frankenstein on those graphs, I have to first (1) introduce M to the second graph (with an arrow from each of N, Xi, and X¯i), and (2) expand the ‘blanket’ graph M←X→N (choosing X¯i→Xi to maintain topological consistency). Then (3) we Frankenstein them, which leaves N dangling, as we wanted.
Next, (4) I have to introduce N to the first graph (again with an arrow from each of M, Xi, and X¯i). I also have a topological ordering issue with the first Frankenstein, so (5) I reorder M to the top by bookkeeping. Now (6) I can Frankenstein those, to sever the X¯i→Xi as hoped.
But now we’ve performed exactly the combo that Stitch was performing in the original proof. The rest of the proof proceeds as before (and we don’t need Stitch).
More bookkeeping rules
These are both useful for ‘expansive’ stuff which is growing the set of variables from some smaller seed. The original post mentions ‘arrow introduction’ but nothing explicitly about nodes. I got these by thinking about these charts as a kind of ‘deductive system’.
Node introduction
A graph without all variables is making a claim about the distribution with those other variables marginalised out.
We can always introduce new variables—but we can’t (by default) assume anything about their independences. It’s sound (safe) to assume they’re dependent on everything else—i.e. they receive an incoming arrow from everywhere. If we know more than that (regarding dependencies), it’s expressed as absence of one or another arrow.
e.g. a graph with A,B,C is making a claim about P(A,B,C). If there’s also a D, we haven’t learned anything about its independences. But we can introduce it, as long as it has arrows A→D, B→D, and C→D.
Node expansion aka un-combine
A graph with combined nodes is making a claim about the distribution as expressed with those variables appearing jointly. There’s nothing expressed about their internal relationship.
We can always expand them out—but we can’t (by default) assume anything about their independences. It’s sound to expand and spell them out in any fully-connected sub-DAG—i.e. they have to be internally fully dependent. We also have to connect every incoming and outgoing edge to every expanded node i.e. if there’s a dependency between the combination and something else, there’s a dependency between each expanded node and that same thing.
e.g. a graph with X,Y,Z is making a claim about P(X,Y,Z). If Y is actually several variables, we can expand them out, as long as we respect all possible interactions that the original graph might have expressed.
Deductive system
I think what we have on our hands is a ‘deductive system’ or maybe grandiosely a ‘logic’. The semantic is actual distributions and divergences. The syntax is graphs (with divergence annotation).
An atomic proposition is a graph G together with a divergence annotation ϵ, which we can write G(ϵ).
Semantically, that’s when the ‘true distribution satisfies G up to ϵ KL divergence’ as you described[1]. Crucially, some variables might not be in the graph. In that case, the distributions in the relevant divergence expression are marginalised over the missing variables. This means that the semantic is always under-determined, because we can always introduce new variables (which are allowed to depend on other variables however they like, being unconstrained by the graph).
Then we’re interested in sound deductive rules like
G(ϵG)⟹H(ϵH)
Syntactically that is ‘when we have deduced G(ϵG) we can deduce H(ϵH)’. That’s sound if, for any distribution P satisfying G(ϵG) we also have P satisfying H(ϵH).
Gesture at general Frankenstitch rule
More generally, I’m reasonably sure Stitch is secretly just multiple applications of Frankenstein, as in the example above. The tricky bit I haven’t strictly worked through is when there’s interleaving of variables on either side of the blanket in the overall topological ordering.
A rough HANDWAVE proof sketch, similar in structure to the example above:
Expand the X←Y→Z blanket graph
The arrows internal to X, Y, and Z need to be complete
We can always choose a complete graph consistent with the X, Y, and Z parts of the original graphs (otherwise there wouldn’t be an overall consistent topology)
Notice that the connections X←Y are all Y to all X, which is not necessarily consistent with the original XY graph
and similarly for the YZ arrows
(there could be X→Y arrows in the original)
Introduce Z to the XY graph (and vice versa)
The newly-introduced nodes are necessarily ‘at the bottom’ (with arrows from everything else)
We can always choose internal connections for the introduced Zs consistent with the original YZ graph
Notice that the connections X→Z and Y→Z in the augmented XY graph all keep Z at the bottom, which is not necessarily consistent with the original YZ graph (and vice versa)
But this is consistent with the Expanded blanket graph
We ‘zip from the bottom’ with successive bookkeeping and Frankensteins
THIS IS WHERE THE HANDWAVE HAPPENS
Just like in the example above, where we got the N sorted out and then moved the introduced M to the ‘top’ in preparation to Frankenstein in the M graph, I think there should always be enough connections between the introduced nodes to ‘move them up’ as needed for the stitch to proceed
I’m not likely to bother proving this strictly, since Stitch is independently valid (though it’d be nice to have a more parsimonious arsenal of ‘basic moves’). I’m sharing this mainly because I think Expansion and Node Introduction are of independent relevance.
More formally, G(ϵ) over variables X is satisfied by distribution P when DKL(P(X)||∏iP(Xi|XpaG(i)))≤ϵ. (This assumes some assignment of variables in G to variables in P.)
Perhaps more importantly, I think with Node Introduction we really don’t need M←X→N after all?
With Node Introduction and some bookkeeping, we can get the N and M graphs topologically compatible, and Frankenstein them. We can’t get as neat a merge as if we also had M←X→N - in particular, we can’t get rid of the arrow M→N. But that’s fine, we were about to draw that arrow in anyway for the next step!
Is something invalid here? Flagging confusion. This is a slightly more substantial claim than the original proof makes, since it assumes strictly less. Downstream, I think it makes the Resample unnecessary.
ETA: it’s cleared up below—there’s an invalid Reorder here (it removes a v-structure).
Mhm, OK I think I see. But X¯i,N,M appear to me to make a complete subgraph, and all I did was redirect the N→M. I confess I am mildly confused by the ‘reorder complete subgraph’ bookkeeping rule. It should apply to the A→B in A→B←C, right? But then I’d be able to deduce A←B←C which is strictly different. So it must mean something other than what I’m taking it to mean.
Maybe need to go back and stare at the semantics for a bit. (But this syntactic view with motifs and transformations is much nicer!)
Oh you’re right, the rule about reordering complete subgraphs is missing some constraints. That’s a mistake on my part.
Thinking about it for ~60 seconds: when reordering a complete subgraph, arrows going out of the subgraph stay the same, but if there’s arrows going in then we may need to introduce new arrows between (former) spouses.
Aha. Preserving v-structures (colliders like X→Y←Z) is necessary and sufficient for equivalence[1]. So when rearranging fully-connected subgraphs, certainly we can’t do it (cost-free) if it introduces or removes any v-structures.
Plausibly if we’re willing to weaken by adding in additional arrows, there might be other sound ways to reorder fully-connected subgraphs—but they’d be non-invertible. Haven’t thought about that.
Either that’s wrong or I’m misinterpreting it, because a fully-connected DAG should be equivalent to any other fully-connected DAG but they all have different v-structures.
Mmm, I misinterpreted at first. It’s only a v-structure if X and Z are not connected. So this is a property which needs to be maintained effectively ‘at the boundary’ of the fully-connected cluster which we’re rewriting. I think that tallies with everything else, right?
ETA: both of our good proofs respect this rule; the first Reorder in my bad proof indeed violates it. I think this criterion is basically the generalised and corrected version of the fully-connected bookkeeping rule described in this post. I imagine if I/someone worked through it, this would clarify whether my handwave proof of Frankenstein ⟹ Stitch is right or not.
That’s concerning. It would appear to make both our proofs invalid.
But I think your earlier statement about incoming vs outgoing arrows makes sense. Maybe Verma & Pearl were asking for some other kind of equivalence? Grr, back to the semantics I suppose.
But this same move can alternatively be done with the Frankenstein rule, right? (I might be missing something.) But Frankenstein has no such additional requirement, as stated. If I’m not missing something, I think Frankenstein might be invalid as stated (like maybe it needs an analogous extra condition). Haven’t thought this through yet.
i.e. I think either
I’m missing something
Frankenstein is invalid
You don’t need M←X→N
Yeah, thinking slightly aloud, I tentatively think Frankenstein needs an extra condition like the blanket stitch condition… something which enforces the choice of topo ordering to be within the right class of topo orderings? That’s what the chain M←X→N does—it means we can assign orderings M,X¯i,Xi,N or M,Xi,X¯i,N, but not e.g. M,X¯i,N,Xi, even though that order is consistent with both of the other original graphs.
If I get some time I’ll return to this and think harder but I can’t guarantee it.
ETA I did spend a bit more time, and the below mostly resolves it: I was indeed missing something, and Frankenstein indeed needs an extra condition, but you do need M←X→N.
I had another look at this with a fresh brain and it was clearer what was happening.
TL;DR: It was both of ‘I’m missing something’, and a little bit ‘Frankenstein is invalid’ (it needs an extra condition which is sort of implicit in the post). As I guessed, with a little extra bookkeeping, we don’t need Stitching for the end-to-end proof. I’m also fairly confident Frankenstein subsumes Stitching in the general case. A ‘deductive system’ lens makes this all clearer (for me).
My Frankenstein mistake
The key invalid move I was making when I said
is that Frankenstein requires all graphs to be over the same set of variables. This is kind of implicit in the post, but I don’t see it spelled out. I was applying it to an (M,X¯i,Xi) graph (N absent) and an (X¯i,Xi,N) graph (M absent). No can do!
Skipping Stitch in the end-to-end proof
I was right though, Frankenstein can be applied. But we first have to do ‘Node Introduction’ and ‘Expansion’ on the graphs to make them compatible (these extra bookkeeping rules are detailed further below.)
So, to get myself in a position to apply Frankenstein on those graphs, I have to first (1) introduce M to the second graph (with an arrow from each of N, Xi, and X¯i), and (2) expand the ‘blanket’ graph M←X→N (choosing X¯i→Xi to maintain topological consistency). Then (3) we Frankenstein them, which leaves N dangling, as we wanted.
Next, (4) I have to introduce N to the first graph (again with an arrow from each of M, Xi, and X¯i). I also have a topological ordering issue with the first Frankenstein, so (5) I reorder M to the top by bookkeeping. Now (6) I can Frankenstein those, to sever the X¯i→Xi as hoped.
But now we’ve performed exactly the combo that Stitch was performing in the original proof. The rest of the proof proceeds as before (and we don’t need Stitch).
More bookkeeping rules
These are both useful for ‘expansive’ stuff which is growing the set of variables from some smaller seed. The original post mentions ‘arrow introduction’ but nothing explicitly about nodes. I got these by thinking about these charts as a kind of ‘deductive system’.
Node introduction
A graph without all variables is making a claim about the distribution with those other variables marginalised out.
We can always introduce new variables—but we can’t (by default) assume anything about their independences. It’s sound (safe) to assume they’re dependent on everything else—i.e. they receive an incoming arrow from everywhere. If we know more than that (regarding dependencies), it’s expressed as absence of one or another arrow.
e.g. a graph with A,B,C is making a claim about P(A,B,C). If there’s also a D, we haven’t learned anything about its independences. But we can introduce it, as long as it has arrows A→D, B→D, and C→D.
Node expansion aka un-combine
A graph with combined nodes is making a claim about the distribution as expressed with those variables appearing jointly. There’s nothing expressed about their internal relationship.
We can always expand them out—but we can’t (by default) assume anything about their independences. It’s sound to expand and spell them out in any fully-connected sub-DAG—i.e. they have to be internally fully dependent. We also have to connect every incoming and outgoing edge to every expanded node i.e. if there’s a dependency between the combination and something else, there’s a dependency between each expanded node and that same thing.
e.g. a graph with X,Y,Z is making a claim about P(X,Y,Z). If Y is actually several variables, we can expand them out, as long as we respect all possible interactions that the original graph might have expressed.
Deductive system
I think what we have on our hands is a ‘deductive system’ or maybe grandiosely a ‘logic’. The semantic is actual distributions and divergences. The syntax is graphs (with divergence annotation).
An atomic proposition is a graph G together with a divergence annotation ϵ, which we can write G(ϵ).
Semantically, that’s when the ‘true distribution satisfies G up to ϵ KL divergence’ as you described[1]. Crucially, some variables might not be in the graph. In that case, the distributions in the relevant divergence expression are marginalised over the missing variables. This means that the semantic is always under-determined, because we can always introduce new variables (which are allowed to depend on other variables however they like, being unconstrained by the graph).
Then we’re interested in sound deductive rules like
G(ϵG)⟹H(ϵH)
Syntactically that is ‘when we have deduced G(ϵG) we can deduce H(ϵH)’. That’s sound if, for any distribution P satisfying G(ϵG) we also have P satisfying H(ϵH).
Gesture at general Frankenstitch rule
More generally, I’m reasonably sure Stitch is secretly just multiple applications of Frankenstein, as in the example above. The tricky bit I haven’t strictly worked through is when there’s interleaving of variables on either side of the blanket in the overall topological ordering.
A rough HANDWAVE proof sketch, similar in structure to the example above:
Expand the X←Y→Z blanket graph
The arrows internal to X, Y, and Z need to be complete
We can always choose a complete graph consistent with the X, Y, and Z parts of the original graphs (otherwise there wouldn’t be an overall consistent topology)
Notice that the connections X←Y are all Y to all X, which is not necessarily consistent with the original XY graph
and similarly for the YZ arrows
(there could be X→Y arrows in the original)
Introduce Z to the XY graph (and vice versa)
The newly-introduced nodes are necessarily ‘at the bottom’ (with arrows from everything else)
We can always choose internal connections for the introduced Zs consistent with the original YZ graph
Notice that the connections X→Z and Y→Z in the augmented XY graph all keep Z at the bottom, which is not necessarily consistent with the original YZ graph (and vice versa)
But this is consistent with the Expanded blanket graph
We ‘zip from the bottom’ with successive bookkeeping and Frankensteins
THIS IS WHERE THE HANDWAVE HAPPENS
Just like in the example above, where we got the N sorted out and then moved the introduced M to the ‘top’ in preparation to Frankenstein in the M graph, I think there should always be enough connections between the introduced nodes to ‘move them up’ as needed for the stitch to proceed
I’m not likely to bother proving this strictly, since Stitch is independently valid (though it’d be nice to have a more parsimonious arsenal of ‘basic moves’). I’m sharing this mainly because I think Expansion and Node Introduction are of independent relevance.
More formally, G(ϵ) over variables X is satisfied by distribution P when DKL(P(X)||∏iP(Xi|XpaG(i)))≤ϵ. (This assumes some assignment of variables in G to variables in P.)
Perhaps more importantly, I think with Node Introduction we really don’t need M←X→N after all?
With Node Introduction and some bookkeeping, we can get the N and M graphs topologically compatible, and Frankenstein them. We can’t get as neat a merge as if we also had M←X→N - in particular, we can’t get rid of the arrow M→N. But that’s fine, we were about to draw that arrow in anyway for the next step!
Is something invalid here? Flagging confusion. This is a slightly more substantial claim than the original proof makes, since it assumes strictly less. Downstream, I think it makes the Resample unnecessary.
ETA: it’s cleared up below—there’s an invalid Reorder here (it removes a v-structure).
I think one of the reorder steps on the bottom line is invalid.
Stock counterexample:
Let Xi and X¯i each be strings of bits. The zeroth bit of each is equal (i.e. X0i=X0¯i); the rest are independent.
M is another bitstring with zeroth bit equal to the zeroth bit of Xi, and all other bits independent of X.
N is a bitstring with zeroth bit equal to the zeroth bit of Xi, with the rest of the bits given by:
odd-numbered bits are the odd-numbered independent bits of M xor independent bits of Xi
even-numbered bits are the even-numbered independent bits of M xor independent bits of X¯i
Then the starting diagrams in your proof are all satisfied, but the final diagram is not.
The step which breaks for this counterexample is the first reorder step on the bottom line.
Mhm, OK I think I see. But X¯i,N,M appear to me to make a complete subgraph, and all I did was redirect the N→M. I confess I am mildly confused by the ‘reorder complete subgraph’ bookkeeping rule. It should apply to the A→B in A→B←C, right? But then I’d be able to deduce A←B←C which is strictly different. So it must mean something other than what I’m taking it to mean.
Maybe need to go back and stare at the semantics for a bit. (But this syntactic view with motifs and transformations is much nicer!)
Oh you’re right, the rule about reordering complete subgraphs is missing some constraints. That’s a mistake on my part.
Thinking about it for ~60 seconds: when reordering a complete subgraph, arrows going out of the subgraph stay the same, but if there’s arrows going in then we may need to introduce new arrows between (former) spouses.
Aha. Preserving v-structures (colliders like X→Y←Z) is necessary and sufficient for equivalence[1]. So when rearranging fully-connected subgraphs, certainly we can’t do it (cost-free) if it introduces or removes any v-structures.
Plausibly if we’re willing to weaken by adding in additional arrows, there might be other sound ways to reorder fully-connected subgraphs—but they’d be non-invertible. Haven’t thought about that.
Verma & Pearl, Equivalence and Synthesis of Causal Models 1990
Either that’s wrong or I’m misinterpreting it, because a fully-connected DAG should be equivalent to any other fully-connected DAG but they all have different v-structures.
Mmm, I misinterpreted at first. It’s only a v-structure if X and Z are not connected. So this is a property which needs to be maintained effectively ‘at the boundary’ of the fully-connected cluster which we’re rewriting. I think that tallies with everything else, right?
ETA: both of our good proofs respect this rule; the first Reorder in my bad proof indeed violates it. I think this criterion is basically the generalised and corrected version of the fully-connected bookkeeping rule described in this post. I imagine if I/someone worked through it, this would clarify whether my handwave proof of Frankenstein ⟹ Stitch is right or not.
Unironically, I think it’s worth anyone interested skimming that Verma & Pearl paper for the pictures :) especially fig 2
That’s concerning. It would appear to make both our proofs invalid.
But I think your earlier statement about incoming vs outgoing arrows makes sense. Maybe Verma & Pearl were asking for some other kind of equivalence? Grr, back to the semantics I suppose.