Can’t find the “Swap rule” in the post, what was that?
The more complex approximation rule requires some additional machinery. For any diagram, we can decompose its DKL into one term for each variable via the chain rule:
If we know how our upper bound ϵ on the diagram’s DKL decomposes across variables, then we can use that for more fine-grained bounds. Suppose that, for each diagram j and variable i, we have an upper bound
ϵij≥E[DKL(P[Xi|X<i]||P[Xi|Xpaj(i)])]<=====
I guess here there are missing expectations (indicated in red), otherwise X<i would remain a free RV. The second expectation is not necessary, although omitting it makes the bound unnecessarily strict.
I have never encountered this topic, and I couldn’t find it skimming the Handbook of Graphical Models (2019), did you invent it? If not, could you list references?
Do you perchance have something for conditioning or marginalization? I know that graphical marginalization requires using ADMGs (graphs with double-headed arrows) instead of DAGs, I don’t know about conditioning.
Exercise
Extend the above proofs to re-rooting of arbitrary trees (i.e. the diagram is a tree). We recommend thinking about your notation first; better choices for notation make working with trees much easier.
In a tree each node has a single parent. Re-rooting means flipping the arrow from the root to one of its children.
In the factorization I do P(old_root)P(new_root|old_root) = P(old_root|new_root)P(new_root), so the proof is the same.
The fact that notation is not involved makes my suspect I may be missing something. Maybe they mean it’s needed to write out in full the factorization expression? I guess the nice notation is using ch(X_i) instead of pa(X_i).
Joint Independence Rule: Exercise
The Joint Independence Rule can be proven using the Frankenstein Rule. This is left as an exercise. (And we mean that unironically, it is actually a good simple exercise which will highlight one or two subtle points, not a long slog of tedium as the phrase “left as an exercise” often indicates.)
Bonus exercise: also prove the conditional version of the Joint Independence Rule using the Frankenstein Rule.
Joint independence starts from the graphs
X_i {X_not i}
for all i. The Frankenstein rule can not be applied right away because the graphs are not on the same variables, since all but one appear grouped as a single tuple node.
To change that, I replace the tuple nodes with fully connected subgraphs. This is valid because cliques do not constrain the distribution, as any distribution can be factorized in any order if all variables are kept in the conditionals.
I choose the connection order of the cliques such that they respect a global ordering.
Then I apply Frankenstein, picking, for each node, the graph where that node is isolated.
Is there a way to obtain a narrower bound? Yes: I can pick only DKL(P(Xi|X<i)∥P(Xi)) from the i-th graph.
Conditional version: basically the same proof, with the common parent as first node in the global order. Needs more care in justifying replacing a tuple node with a clique: Conditional on the parent, the argument above still goes. The only additional independence property of the graph with the tuple is that the parent d-separates the tuple from the singlet, and that is preserved as well.
Frankenstein Rule
We’ll prove the approximate version, then the exact version follows trivially.
Without loss of generality, assume the order of variables which satisfies all original diagrams is X1,…,Xn. Let P[X]=∏iP[Xi|Xpaj(i)] be the factorization expressed by diagram j, and let σ(i) be the diagram from which the parents of Xi are taken to form the Frankenstein diagram. (The factorization expressed by the Frankenstein diagram is then P[X]=∏iP[Xi|Xpaσ(i)(i)].)
The proof starts by applying the chain rule to the DKL of the Frankenstein diagram:
I guess you meant the green thing (first margin arrow) to appear in place of the red thing (second margin arrow)? The math is right, but I hypothesize the final line was intended to express at once the narrow and the loose bounds.
I have never encountered this topic, and I couldn’t find it skimming the Handbook of Graphical Models (2019), did you invent it? If not, could you list references?
I wouldn’t be surprised if someone else has done this before, but I did not get it from anywhere else and have never seen anyone else do something quite like it.
Can’t find the “Swap rule” in the post, what was that?
I guess here there are missing expectations (indicated in red), otherwise X<i would remain a free RV. The second expectation is not necessary, although omitting it makes the bound unnecessarily strict.
I have never encountered this topic, and I couldn’t find it skimming the Handbook of Graphical Models (2019), did you invent it? If not, could you list references?
Do you perchance have something for conditioning or marginalization? I know that graphical marginalization requires using ADMGs (graphs with double-headed arrows) instead of DAGs, I don’t know about conditioning.
In a tree each node has a single parent. Re-rooting means flipping the arrow from the root to one of its children.
In the factorization I do P(old_root)P(new_root|old_root) = P(old_root|new_root)P(new_root), so the proof is the same.
The fact that notation is not involved makes my suspect I may be missing something. Maybe they mean it’s needed to write out in full the factorization expression? I guess the nice notation is using ch(X_i) instead of pa(X_i).
Joint independence starts from the graphs
X_i {X_not i}
for all i. The Frankenstein rule can not be applied right away because the graphs are not on the same variables, since all but one appear grouped as a single tuple node.
To change that, I replace the tuple nodes with fully connected subgraphs. This is valid because cliques do not constrain the distribution, as any distribution can be factorized in any order if all variables are kept in the conditionals.
I choose the connection order of the cliques such that they respect a global ordering.
Then I apply Frankenstein, picking, for each node, the graph where that node is isolated.
Is there a way to obtain a narrower bound? Yes: I can pick only DKL(P(Xi|X<i)∥P(Xi)) from the i-th graph.
Conditional version: basically the same proof, with the common parent as first node in the global order. Needs more care in justifying replacing a tuple node with a clique: Conditional on the parent, the argument above still goes. The only additional independence property of the graph with the tuple is that the parent d-separates the tuple from the singlet, and that is preserved as well.
I guess you meant the green thing (first margin arrow) to appear in place of the red thing (second margin arrow)? The math is right, but I hypothesize the final line was intended to express at once the narrow and the loose bounds.
Oh lol, that was in a draft and I took it out.
Good catch, thanks.
Exercise answers look good.
Aaaand about this?
I wouldn’t be surprised if someone else has done this before, but I did not get it from anywhere else and have never seen anyone else do something quite like it.