A Robust Natural Latent Over A Mixed Distribution Is Natural Over The Distributions Which Were Mixed

This post walks through the math for a theorem. It’s intended to be a reference post, which we’ll link back to as-needed from future posts. The question which first motivated this theorem for us was: “Redness of a marker seems like maybe a natural latent over a bunch of parts of the marker, and redness of a car seems like maybe a natural latent over a bunch of parts of the car, but what makes redness of the marker ‘the same as’ redness of the car? How are they both instances of one natural thing, i.e. redness? (or ‘color’?)”. But we’re not going to explain in this post how the math might connect to that use-case; this post is just the math.

Suppose we have multiple distributions $P^{1}, \dots, P^{k}$ over the same random variables $X_{1}, \dots, X_{n}$ . (Speaking somewhat more precisely: the distributions are over the same set, and an element of that set is represented by values $(x_{1}, \dots, x_{n})$ .) We take a mixture of the distributions: $P [X] := \sum_{j} α_{j} P^{j} [X]$ , where $\sum_{j} α_{j} = 1$ and $α$ is nonnegative. Then our theorem says: if an approximate natural latent exists over $P [X]$ , and that latent is robustly natural under changing the mixture weights $α$ , then the same latent is approximately natural over $P^{j} [X]$ for all $j$ .

Mathematically: the natural latent over $P [X]$ is defined by $(x, λ \mapsto P [Λ = λ | X = x])$ , and naturality means that the distribution $(x, λ \mapsto P [Λ = λ | X = x] P [X = x])$ satisfies the naturality conditions (mediation and redundancy).The theorem says that, if the joint distribution $(x, λ \mapsto P [Λ = λ | X = x] \sum_{j} α_{j} P^{j} [X = x])$ satisfies the naturality conditions robustly with respect to changes in $α$ , then $(x, λ \mapsto P [Λ = λ | X = x] P^{j} [X = x])$ satisfies the naturality conditions for all $j$ . “Robustness” here can be interpreted in multiple ways—we’ll cover two here, one for which the theorem is trivial and another more substantive, but we expect there are probably more notions of “robustness” which also make the theorem work.

Trivial Version

First notion of robustness: the joint distribution $(x, λ \mapsto P [Λ = λ | X = x] \sum_{j} α_{j} P^{j} [X = x])$ satisfies the naturality conditions to within $ϵ$ for all values of $α$ (subject to $\sum_{j} α_{j} = 1$ and $α$ nonnegative).

Then: the joint distribution $(x, λ \mapsto P [Λ = λ | X = x] \sum_{j} α_{j} P^{j} [X = x])$ satisfies the naturality conditions to within $ϵ$ specifically for $α_{j} = δ_{j k}$ , i.e. $α$ which is 0 in all entries except a 1 in entry $k$ . In that case, the joint distribution is $(x, λ \mapsto P [Λ = λ | X = x] P^{k} [X = x])$ , therefore $Λ$ is natural over $P^{k}$ . Invoke for each k, and the theorem is proven.

… but that’s just abusing an overly-strong notion of robustness. Let’s do a more interesting one.

Nontrivial Version

Second notion of robustness: the joint distribution $(x, λ \mapsto P [Λ = λ | X = x] \sum_{j} α_{j} P^{j} [X = x])$ satisfies the naturality conditions to within $ϵ$ , and the gradient of the approximation error with respect to (allowed) changes in $α$ is (locally) zero.

We need to prove that the joint distributions $(x, λ \mapsto P [Λ = λ | X = x] P^{j} [X = x])$ satisfy both the mediation and redundancy conditions for each $j$ . We’ll start with redundancy, because it’s simpler.

Redundancy

We can express the approximation error of the redundancy condition with respect to $X_{i}$ under the mixed distribution as

$D_{K L} (P [Λ, X] | | P [X] P [Λ | X_{i}]) = E_{X} [D_{K L} (P [Λ | X] | | P [Λ | X_{i}])]$

where, recall, $P [Λ, X] := P [Λ | X] \sum_{j} α_{j} P^{j} [X]$ .

We can rewrite that approximation error as:

$E_{X} [D_{K L} (P [Λ | X] | | P [Λ | X_{i}])]$

$= \sum_{j} α_{j} P^{j} [X] D_{K L} (P [Λ | X] | | P [Λ | X_{i}])$

$= \sum_{j} α_{j} E_{X}^{j} [D_{K L} (P [Λ | X] | | P [Λ | X_{i}])]$

Note that $P^{j} [Λ | X] = P [Λ | X]$ is the same under all the distributions (by definition), so:

$= \sum_{j} α_{j} D_{K L} (P^{j} [Λ, X] | | P [Λ | X_{i}] P^{j} [X])$

and by factorization transfer:

$\geq \sum_{j} α_{j} D_{K L} (P^{j} [Λ, X] | | P^{j} [Λ | X_{i}] P^{j} [X])$

In other words: if $ϵ_{i}^{j}$ is the redundancy error with respect to $X_{i}$ under distribution $j$ , and $ϵ_{i}$ is the redundancy error with respect to $X_{i}$ under the mixed distribution $P$ , then

$ϵ_{i} \geq \sum_{j} α_{j} ϵ_{i}^{j}$

The redundancy error of the mixed distribution is at least the weighted average of the redundancy errors of the individual distributions.

Since the $α_{j} ϵ_{i}^{j}$ terms are nonnegative, that also means

$ϵ_{i}^{j} \leq \frac{1}{α_{j}} ϵ_{i}$

which bounds the approximation error for the $i^{t h}$ redundancy condition under distribution $j$ . Also note that, insofar as the latent is natural across multiple $α$ values, we can use the $α$ value with largest $α_{j}$ to get the best bound for $ϵ_{i}^{j}$ .

Mediation

Mediation relies more heavily on the robustness of naturality to changes in $α$ . The gradient of the mediation approximation error with respect to $α$ is:

$\frac{\partial}{\partial α_{j}} D_{K L} (P [Λ, X] | | P [Λ] \prod_{i} P [X_{i} | Λ])$

$= \sum_{X, Λ} P [Λ | X] P^{j} [X] ln \frac{P [Λ, X]}{P [Λ] \prod_{i} P [X_{i} | Λ]}$

(Note: it’s a nontrivial but handy fact that, in general, the change in approximation error of a distribution $P [Y]$ over some DAG $d D_{K L} (P [Y] | | \prod_{i} P [Y_{i} | Y_{p a (i)}])$ under a change $d P$ is $\sum_{Y} d P [Y] ln \frac{P [Y]}{\prod_{i} P [Y_{i} | Y_{p a (i)}]}$ .)

Note that this gradient must be zero along allowed changes in $α$ , which means the changes must respect $\sum_{j} α_{j} = 1$ . That means the gradient must be constant across indices $j$ :

$constant = \sum_{X, Λ} P [Λ | X] P^{j} [X] ln \frac{P [Λ, X]}{P [Λ] \prod_{i} P [X_{i} | Λ]}$

To find that constant, we can take a sum weighted by $α_{j}$ on both sides:

$constant = \sum_{j} α_{j} \sum_{X, Λ} P [Λ | X] P^{j} [X] ln \frac{P [Λ, X]}{P [Λ] \prod_{i} P [X_{i} | Λ]}$

$= D_{K L} (P [Λ, X] | | P [Λ] \prod_{i} P [X_{i} | Λ])$

So, robustness tells us that the approximation error under the mixed distribution can be written as

$D_{K L} (P [Λ, X] | | P [Λ] \prod_{i} P [X_{i} | Λ]) = constant = \sum_{X, Λ} P [Λ | X] P^{j} [X] ln \frac{P [Λ, X]}{P [Λ] \prod_{i} P [X_{i} | Λ]}$

for any $j$ .

Next, we’ll write out $P [Λ, X]$ as a mixture weighted by $α$ , and use Jensen’s inequality on that mixture and the logarithm:

$= E^{j} [ln \frac{\sum_{j} α_{j} P [Λ | X] P^{j} [X]}{P [Λ] \prod_{i} P [X_{i} | Λ]}]$

$\geq E^{j} [\sum_{j} α_{j} ln \frac{P [Λ | X] P^{j} [X]}{P [Λ] \prod_{i} P [X_{i} | Λ]}]$

$= \sum_{j} α_{j} D_{K L} (P^{j} [Λ, X] | | P [Λ] \prod_{i} P [X_{i} | Λ])$

Then factorization transfer gives:

$\geq \sum_{j} α_{j} D_{K L} (P^{j} [Λ, X] | | P^{j} [Λ] \prod_{i} P^{j} [X_{i} | Λ])$

Much like redundancy, if $ϵ_{i}^{j}$ is the mediation error with respect to $X_{i}$ under distribution $j$ (note that we’re overloading notation, $ϵ$ is no longer the redundancy error), and $ϵ_{i}$ is the mediation error with respect to $X_{i}$ under the mixed distribution $P$ , then the above says

$ϵ_{i} \geq \sum_{j} α_{j} ϵ_{i}^{j}$

Since the $α_{j} ϵ_{i}^{j}$ terms are nonnegative, that also means

$ϵ_{i}^{j} \leq \frac{1}{α_{j}} ϵ_{i}$

which bounds the approximation error for the $i^{t h}$ mediation condition under distribution $j$ .