A Robust Natural Latent Over A Mixed Distribution Is Natural Over The Distributions Which Were Mixed

22 Aug 2024 19:19 UTC

LW: 42 AF: 19

This post walks through the math for a theorem. It’s intended to be a reference post, which we’ll link back to as-needed from future posts. The question which first motivated this theorem for us was: “Redness of a marker seems like maybe a natural latent over a bunch of parts of the marker, and redness of a car seems like maybe a natural latent over a bunch of parts of the car, but what makes redness of the marker ‘the same as’ redness of the car? How are they both instances of one natural thing, i.e. redness? (or ‘color’?)”. But we’re not going to explain in this post how the math might connect to that use-case; this post is just the math.

Suppose we have multiple distributions $P^{1}, \dots, P^{k}$ over the same random variables $X_{1}, \dots, X_{n}$ . (Speaking somewhat more precisely: the distributions are over the same set, and an element of that set is represented by values $(x_{1}, \dots, x_{n})$ .) We take a mixture of the distributions: $P [X] := \sum_{j} α_{j} P^{j} [X]$ , where $\sum_{j} α_{j} = 1$ and $α$ is nonnegative. Then our theorem says: if an approximate natural latent exists over $P [X]$ , and that latent is robustly natural under changing the mixture weights $α$ , then the same latent is approximately natural over $P^{j} [X]$ for all $j$ .

Mathematically: the natural latent over $P [X]$ is defined by $(x, λ \mapsto P [Λ = λ | X = x])$ , and naturality means that the distribution $(x, λ \mapsto P [Λ = λ | X = x] P [X = x])$ satisfies the naturality conditions (mediation and redundancy).The theorem says that, if the joint distribution $(x, λ \mapsto P [Λ = λ | X = x] \sum_{j} α_{j} P^{j} [X = x])$ satisfies the naturality conditions robustly with respect to changes in $α$ , then $(x, λ \mapsto P [Λ = λ | X = x] P^{j} [X = x])$ satisfies the naturality conditions for all $j$ . “Robustness” here can be interpreted in multiple ways—we’ll cover two here, one for which the theorem is trivial and another more substantive, but we expect there are probably more notions of “robustness” which also make the theorem work.

Trivial Version

First notion of robustness: the joint distribution $(x, λ \mapsto P [Λ = λ | X = x] \sum_{j} α_{j} P^{j} [X = x])$ satisfies the naturality conditions to within $ϵ$ for all values of $α$ (subject to $\sum_{j} α_{j} = 1$ and $α$ nonnegative).

Then: the joint distribution $(x, λ \mapsto P [Λ = λ | X = x] \sum_{j} α_{j} P^{j} [X = x])$ satisfies the naturality conditions to within $ϵ$ specifically for $α_{j} = δ_{j k}$ , i.e. $α$ which is 0 in all entries except a 1 in entry $k$ . In that case, the joint distribution is $(x, λ \mapsto P [Λ = λ | X = x] P^{k} [X = x])$ , therefore $Λ$ is natural over $P^{k}$ . Invoke for each k, and the theorem is proven.

… but that’s just abusing an overly-strong notion of robustness. Let’s do a more interesting one.

Nontrivial Version

Second notion of robustness: the joint distribution $(x, λ \mapsto P [Λ = λ | X = x] \sum_{j} α_{j} P^{j} [X = x])$ satisfies the naturality conditions to within $ϵ$ , and the gradient of the approximation error with respect to (allowed) changes in $α$ is (locally) zero.

We need to prove that the joint distributions $(x, λ \mapsto P [Λ = λ | X = x] P^{j} [X = x])$ satisfy both the mediation and redundancy conditions for each $j$ . We’ll start with redundancy, because it’s simpler.

Redundancy

We can express the approximation error of the redundancy condition with respect to $X_{i}$ under the mixed distribution as

$D_{K L} (P [Λ, X] | | P [X] P [Λ | X_{i}]) = E_{X} [D_{K L} (P [Λ | X] | | P [Λ | X_{i}])]$

where, recall, $P [Λ, X] := P [Λ | X] \sum_{j} α_{j} P^{j} [X]$ .

We can rewrite that approximation error as:

$E_{X} [D_{K L} (P [Λ | X] | | P [Λ | X_{i}])]$

$= \sum_{j} α_{j} P^{j} [X] D_{K L} (P [Λ | X] | | P [Λ | X_{i}])$

$= \sum_{j} α_{j} E_{X}^{j} [D_{K L} (P [Λ | X] | | P [Λ | X_{i}])]$

Note that $P^{j} [Λ | X] = P [Λ | X]$ is the same under all the distributions (by definition), so:

$= \sum_{j} α_{j} D_{K L} (P^{j} [Λ, X] | | P [Λ | X_{i}] P^{j} [X])$

and by factorization transfer:

$\geq \sum_{j} α_{j} D_{K L} (P^{j} [Λ, X] | | P^{j} [Λ | X_{i}] P^{j} [X])$

In other words: if $ϵ_{i}^{j}$ is the redundancy error with respect to $X_{i}$ under distribution $j$ , and $ϵ_{i}$ is the redundancy error with respect to $X_{i}$ under the mixed distribution $P$ , then

$ϵ_{i} \geq \sum_{j} α_{j} ϵ_{i}^{j}$

The redundancy error of the mixed distribution is at least the weighted average of the redundancy errors of the individual distributions.

Since the $α_{j} ϵ_{i}^{j}$ terms are nonnegative, that also means

$ϵ_{i}^{j} \leq \frac{1}{α_{j}} ϵ_{i}$

which bounds the approximation error for the $i^{t h}$ redundancy condition under distribution $j$ . Also note that, insofar as the latent is natural across multiple $α$ values, we can use the $α$ value with largest $α_{j}$ to get the best bound for $ϵ_{i}^{j}$ .

Mediation

Mediation relies more heavily on the robustness of naturality to changes in $α$ . The gradient of the mediation approximation error with respect to $α$ is:

$\frac{\partial}{\partial α_{j}} D_{K L} (P [Λ, X] | | P [Λ] \prod_{i} P [X_{i} | Λ])$

$= \sum_{X, Λ} P [Λ | X] P^{j} [X] ln \frac{P [Λ, X]}{P [Λ] \prod_{i} P [X_{i} | Λ]}$

(Note: it’s a nontrivial but handy fact that, in general, the change in approximation error of a distribution $P [Y]$ over some DAG $d D_{K L} (P [Y] | | \prod_{i} P [Y_{i} | Y_{p a (i)}])$ under a change $d P$ is $\sum_{Y} d P [Y] ln \frac{P [Y]}{\prod_{i} P [Y_{i} | Y_{p a (i)}]}$ .)

Note that this gradient must be zero along allowed changes in $α$ , which means the changes must respect $\sum_{j} α_{j} = 1$ . That means the gradient must be constant across indices $j$ :

$constant = \sum_{X, Λ} P [Λ | X] P^{j} [X] ln \frac{P [Λ, X]}{P [Λ] \prod_{i} P [X_{i} | Λ]}$

To find that constant, we can take a sum weighted by $α_{j}$ on both sides:

$constant = \sum_{j} α_{j} \sum_{X, Λ} P [Λ | X] P^{j} [X] ln \frac{P [Λ, X]}{P [Λ] \prod_{i} P [X_{i} | Λ]}$

$= D_{K L} (P [Λ, X] | | P [Λ] \prod_{i} P [X_{i} | Λ])$

So, robustness tells us that the approximation error under the mixed distribution can be written as

$D_{K L} (P [Λ, X] | | P [Λ] \prod_{i} P [X_{i} | Λ]) = constant = \sum_{X, Λ} P [Λ | X] P^{j} [X] ln \frac{P [Λ, X]}{P [Λ] \prod_{i} P [X_{i} | Λ]}$

for any $j$ .

Next, we’ll write out $P [Λ, X]$ as a mixture weighted by $α$ , and use Jensen’s inequality on that mixture and the logarithm:

$= E^{j} [ln \frac{\sum_{j} α_{j} P [Λ | X] P^{j} [X]}{P [Λ] \prod_{i} P [X_{i} | Λ]}]$

$\geq E^{j} [\sum_{j} α_{j} ln \frac{P [Λ | X] P^{j} [X]}{P [Λ] \prod_{i} P [X_{i} | Λ]}]$

$= \sum_{j} α_{j} D_{K L} (P^{j} [Λ, X] | | P [Λ] \prod_{i} P [X_{i} | Λ])$

Then factorization transfer gives:

$\geq \sum_{j} α_{j} D_{K L} (P^{j} [Λ, X] | | P^{j} [Λ] \prod_{i} P^{j} [X_{i} | Λ])$

Much like redundancy, if $ϵ_{i}^{j}$ is the mediation error with respect to $X_{i}$ under distribution $j$ (note that we’re overloading notation, $ϵ$ is no longer the redundancy error), and $ϵ_{i}$ is the mediation error with respect to $X_{i}$ under the mixed distribution $P$ , then the above says

$ϵ_{i} \geq \sum_{j} α_{j} ϵ_{i}^{j}$

Since the $α_{j} ϵ_{i}^{j}$ terms are nonnegative, that also means

$ϵ_{i}^{j} \leq \frac{1}{α_{j}} ϵ_{i}$

which bounds the approximation error for the $i^{t h}$ mediation condition under distribution $j$ .

What links here?

Interoperable High Level Structures: Early Thoughts on Adjectives by johnswentworth (22 Aug 2024 21:12 UTC; 55 points)

johnswentworth and David Lorell

22 Aug 2024 19:19 UTC

LW: 42 AF: 19

4 comments4 min readLW link

faul_sname 22 Aug 2024 23:31 UTC
37 points
3
Alright, I’m terrible at abstract thinking, so I went through the post and came up with a concrete example. Does this seem about right?
Suppose we have multiple distributions $P^{1}, \dots, P^{k}$ over the same random variables $X_{1}, \dots, X_{n}$ . (Speaking somewhat more precisely: the distributions are over the same set, and an element of that set is represented by values $(x_{1}, \dots, x_{n})$ .)
We are a quantitative trading firm. Our investment strategy is such that we care about the prices of the stocks in the S&P 500 at market close today ( $X_{1}, \dots, X_{n}$ ).
We have a bunch of models of the stock market ( $P^{1}, \dots, P^{k}$ ), where we can feed in a set of possible prices of stocks in the S&P 500 at market close, and the model spits out a probability of seeing that exact combination of prices (where a single combination of prices is $(x_{1}, \dots, x_{n})$ ).
We take a mixture of the distributions: $P [X] := \sum_{j} α_{j} P^{j} [X]$ , where $\sum_{j} α_{j} = 1$ and $α$ is nonnegative
We believe that some of our models are better than others, so our trading strategy is to take a weighted average of the predictions of each model, where the weight assigned to the $j$ th model $P^{j}$ is $α_{j}$ , and obviously the weights have to sum to 1 for this to be an “average”.
Mathematically: the natural latent over $P [X]$ is defined by $(x, λ \mapsto P [Λ = λ | X = x])$ , and naturality means that the distribution $(x, λ \mapsto P [Λ = λ | X = x] P [X = x])$ satisfies the naturality conditions (mediation and redundancy).
We believe that there is some underlying factor which we will call “market factors” ( $Λ$ ) such that if you control for “market factors”, you no longer learn (approximately) anything about the price of say MSFT when you learn about the price of AAPL, and also such that if you order the stocks in the S&P 500 alphabetically and then take the odd-indexed stocks (i.e. A, AAPL, ABNB, …) in that list and call them the S&P250odd, and call the even-indexed (i.e. AAL, ABBV, ABT, …) ones the S&P250even, you will come to (approximately) the same estimation of “market factors” by looking at either the S&P250odd or the S&P250even. Further, this means that if you estimate “market conditions” by looking at S&P250odd, then your estimation of the price of AAL will be approximately unchanged if you learn the price of ABT.
Then our theorem says: if an approximate natural latent exists over $P [X]$ , and that latent is robustly natural under changing the mixture weights $α$ , then the same latent is approximately natural over $P^{j} [X]$ for all $j$ .
Anyway, if we find that the above holds for the weighted sum we use in practice, and we also find that it robustly ^[1] holds when we change the weights, that actually means that all of our market price models take “market factors” into account.
Alternatively stated, it means that if one of the models was written by an intern that procrastinated until the end of his internship and then on the last morning wrote def predict_price(ticker): return numpy.random.lognormal(), then our weighted sum is not robust to changes in the weights.
Is this a reasonable interpretation? If so, I’m pretty interested to see where you go with this.
1. ^
  Terms and conditions apply. This information is not intended as, and shall not be understood or construed as, financial advice.
What links here?
- faul_sname's comment on New Paper: Infra-Bayesian Decision-Estimation Theory by Vanessa Kosoy (11 Apr 2025 0:13 UTC; 7 points)
- Zolmeister's comment on Raemon’s Shortform by Raemon (11 Sep 2024 1:52 UTC; 3 points)
- johnswentworth 22 Aug 2024 23:34 UTC
  6 points
  0
  Parent
  Nailed it, well done.
  - faul_sname 23 Aug 2024 0:15 UTC
    6 points
    0
    Parent
    One point of confusion I still have is what a natural latent screens off information relative to the prediction capabilities of.
    
    Let’s say one of the models “YTDA” in the ensemble knows the beginning-of-year price of each stock, and uses “average year-to-date market appreciation” as its latent., and so learning the average year-to-date market appreciation of the S&P250odd will tell it approximately everything about that latent, and learning the year-to-date appreciation of ABT will give it almost no information it knows how to use about the year-to-date appreciation of AMGN.
    
    So relative to the predictive capabilities of the YTDA model, I think it is true that “average year-to-date market appreciation” is a natural latent.
    
    However, another model “YTDAPS” in the ensemble might use “per-sector average year-to-date market appreciation” as its latent. Since both the S&P250even and S&P250odd contain plenty of stocks in each sector, it is again the case that once you know the YTDAPS’ latent conditioning on S&P250odd, learning the price of ABT will not tell the YTDAPS model anything about the price of AMGN.
    
    But then if both of these are latents, does that mean that your theorem proves that any weighted sum of natural latents is also itself a natural latent?
Thane Ruthenis 23 Aug 2024 16:02 UTC
LW: 4 AF: 3
0
AF
Let’s see if I get this right...
- Let’s interpret the set $X$ as the set of all possible visual sensory experiences $x = (x_{1}, \dots, x_{n})$ , where $x_{i}$ defines the color of the $i$ th pixel.
- Different distributions over elements of this set correspond to observing different objects; for example, we can have $P_{car} (X)$ and $P_{apple} (X)$ , corresponding to us predicting different sensory experiences when looking at cars vs. apples.
- Let’s take some specific specific set of observations $X_{O} \subset X$ , from which we’d be trying to derive a latent.
- We assume uncertainty regarding what objects generated the training-set observations, getting a mixture of distributions $Q_{α} (X_{O}) = α P_{car} (X_{O}) + (1 - α) P_{apple} (X_{O})$ .
- We derive a natural latent $Λ$ for $Q_{α} (X_{O})$ such that $Q_{α} (X_{O} | Λ) = Π_{x \in X_{O}} Q_{α} (X_{O} = x | Λ)$ for all allowed $α$ .
- This necessarily implies that $Λ$ also induces independence between different sensory experiences for each individual distribution in the mixture: $P_{car} (X_{O} | Λ) = Π_{x \in X_{O}} P_{car} (X_{O} = x | Λ)$ and $P_{apple} (X_{O} | Λ) = Π_{x \in X_{O}} P_{apple} (X_{O} = x | Λ)$ .
- If the set $X_{O}$ contains some observations generated by cars and some observations generated by apples, yet a nontrivial latent over the entire set nonetheless exists, then this latent must summarize information about some feature shared by both objects.
  - For example, perhaps it transpired that all cars depicted in this dataset are red, and all apples in this dataset are red, so $Λ = Λ_{redness}$ ends up as “the concept of redness”.
- This latent then could, prospectively, be applied to new objects. If we later learn of the existence of $P_{ink} (X)$ – an object seeing which predicts yet another distribution over visual experiences – then $Λ_{redness}$ would “know” how to handle this “out of the box”. For example, if we have a set of observations $X_{O^{'}}$ such that it contains some red cars and some red ink, then $Λ_{redness}$ would be natural over this set under both distributions, without us needing to recompute it.
- This trick could be applied for learning new “features” of objects. Suppose we have some established observation-sets $X_{cars}$ and $X_{apples}$ , which have nontrivial natural latents $Λ_{car}$ and $Λ_{apple}$ . To find new “object-agnostic” latents, we can try to form new sets of observations from subsets of those observations, define corresponding distributions, and see if mixtures of distributions over those subsets have nontrivial latents.
  - Formally: $X_{test} = X_{specific-cars} \cup X_{specific-apples}$ where $X_{specific-cars} \subset X_{cars}$ and $X_{specific-apples} \subset X_{apples}$ , then $H_{α} (X_{test}) = α P_{car} (X_{test}) + (1 - α) P_{apple} (X_{test})$ , and we want to see if we have a new $Λ$ that induces (approximate) independence between all $x \in X_{test}$ both under the “apple” and the “car” distributions.
  - Though note that it could be done the other way around as well: we could first learn the latents of “redness” and e. g. “greenness” by grouping all red-having and green-having observations, then try to find some subsets of those sets which also have nontrivial natural latents, and end up deriving the latent of “car” by grouping all red and green objects that happen to be cars.
    (Which is to say, I’m not necessarily sure there’s a sharp divide between “adjectives” and “nouns” in this formulation. “The property of car-ness” is interpretable as an adjective here, and “greenery” is interpretable as a noun.)
  - I’d also expect that the latent over $X_{red-cars}$ , i. e. $Λ_{red-car}$ , could be constructed out of $Λ_{car}$ and $Λ_{redness}$ (derived, respectively, from a pure-cars dataset and an all-red dataset)? In other words, if we simultaneously condition a dateset of red cars on a latent derived from a dataset of any-colored cars and a latent derived from a dateset of red-colored objects, then this combined latent $Λ_{redness} \cdot Λ_{car}$ would induce independence across $X_{red-cars}$ (which $Λ_{car}$ wouldn’t be able to do on its own, due to the instances sharing color-related information in addition to car-ness)?
- All of this is interesting mostly in the approximate-latent regime (this allows us to avoid the nonrobust-to-tiny-mixtures trap), and in situations in which we already have some established latents which we want to break down into interoperable features.
  1. In principle, if we have e. g. two sets of observations that we already know correspond to nontrivial latents, e. g. $X_{cars}$ and $X_{apples}$ , we could directly try to find subsets of their union that correspond to new nontrivial latents, in the hopes of recovering some features that’d correspond to grouping observations along some other dimension.
  2. But if we already have established “object-typed” probability distributions $P_{car} (X)$ and $P_{apple} (X)$ , then hypothesizing that the observations are generated by an arbitrary mixture of these distributions allows us to “wash out” any information that doesn’t actually correspond to some robustly shared features of cars-or-apples.
  3. That is: consider if $X_{test}$ is 99% cars, 1% apples. Then an approximately correct natural latent over it is basically just $Λ_{car}$ , maybe with some additional noise from apples thrown in. This is what we’d get if we used the “naive” procedure in (1) above. But if we’re allowed to mix up the distributions, then “ramping” up the “apple” distribution (defining $Q_{α = 0.01} (X)$ , say) would end up with low probabilities assigned to all observations corresponding to cars, and now the approximately correct natural latent over this dataset would have more apple-like qualities. The demand for the latent to be valid on arbitrary $α \in [0, 1]$ then “washes out” all traces of car-ness and apple-ness, leaving only redness.
Is this about right? I’m getting a vague sense of some disconnect between this formulation and the OP...