Assume two agents have the same predictive distribution P[X] over variables X, but model that distribution using potentially-different latent variables. If the latents both satisfy some simple “naturality” conditions (mediation and redundancy) then the two agents’ latents contain approximately the same information about X. So, insofar as the two agents both use natural latents internally, we have reason to expect that the internal latents of one can be faithfully translated into the internal latents of the other.
This post is about one potential weakness in that claim: what happens when the two agents’ predictive distributions are only approximately the same?
Following the pattern of our previous theorems, we’d ideally say something like
If the two agents’ distributions are within ϵ of each other (as measured by some KL-divergences), then their natural latents contain approximately the same information about X, to within some O(ϵ) bound.
But that turns out to be false.
The Tiny Mixtures Counterexample
Let’s start with two distributions, P0 and Q0, over X. These won’t be our two agents’ distributions—we’re going to construct our two agents’ distributions by mixing these two together, as the name “tiny mixtures” suggests.
P0 and Q0 will have extremely different natural latents. Specifically:
X1 consists of 1 million bits, X2 consists of another 1 million bits
Under P0, X1 is uniform, and X2=X1. So, there is an exact natural latent ΛP=X1=X2 under P0.
Under Q0, X1 and X2 are independent and uniform. So, the empty latent ΛQ is exactly natural under Q0.
Mental picture: we have a million-bit channel, under P0 the output (X2) is equal to the input (X1), while under Q0 the channel hardware is maintained by Comcast so they’re independent.
Now for our two agents’ distributions, P and Q. P will be almostP0, and Q will be almostQ0, but each agent puts a 1250 probability on the other distribution:
P=(1−1250)P0+1250Q0
Q=1250P0+(1−1250)Q0
First key observation: DKL(P||Q) and DKL(Q||P) are both roughly 50 bits. Calculation:
Intuitively: since each distribution puts roughly 1250 on the other, it takes about 50 bits of evidence to update from either one to the other.
Second key observation: the empty latent is approximately natural under Q, and the latent Λ:=X1 is approximately natural under P. Epsilons:
Under Q, the empty latent satisfies mediation to within about 1250∗1000000≈1230 bits (this is just mutual information of X1 and X2 under Q), and redundancy exactly (since the empty latent can always be exactly computed from any input).
Under P, Λ:=X1 satisfies mediation exactly (since X1 mediates between X1 and anything else), redundancy with respect to X2 exactly (Λ=X1 can be exactly computed from just X1 without X2), and redundancy with respect to X1 to within about 1250∗1000000≈1230 bits (since there’s a 1250 chance that X2doesn’t tell us the relevant 1000000 bits).
… and of course the information those two latents tell us about X differs by 1 million bits: one of them is empty, and the other directly tells us 1 million bits about X1.
Now, let’s revisit the claim we would’ve liked to make:
If the two agents’ distributions are within ϵ of each other (as measured by some KL-divergences), then their natural latents contain approximately the same information about X, to within some O(ϵ) bound.
Tiny mixtures rule out any claim along those lines. Generalizing the counterexample to an N bit channel (where N=1000000 above) and a mixin probability of 12M (where M=50 above), we generally see that the two latents are natural over their respective distributions to about 12MN, the DKL between the distributions is about 12M in either direction, yet one latent contains N bits of information about X while the other contains zero. By choosing 2M>>N, with both M and N large, we can get arbitrarily precise natural latents over the two distributions, with the difference in the latents exponentially large with respect to the DKL’s between distributions.
What To Do Instead?
So the bound we’d ideally like is ruled out. What alternatives might we aim for?
Different Kind of Approximation
Looking at the counterexample, one thing which stands out is that P and Q are, intuitively, very different distributions. Arguably, the problem is that a “small” DKL just doesn’t imply that the distributions are all that close together; really we should use some other kind of approximation.
On the other hand, DKL is a pretty nice principled error-measure with nice properties, and in particular it naturally plugs into information-theoretic or thermodynamic machinery. And indeed, we are hoping to plug all this theory into thermodynamic-style machinery down the road. For that, we need global bounds, and they need to be information-theoretic.
Additional Requirements for Natural Latents
Coming from another direction: a 50-bit update can turn Q into P, or vice-versa. So one thing this example shows is that natural latents, as they’re currently formulated, are not necessarily robust to even relatively small updates, since 50 bits can quite dramatically change a distribution.
Interestingly, there do exist other natural latents over these two distributions which are approximately the same (under their respective distributions) as the two natural latents we used above, but more robust (in some ways) to turning one distribution into the other. In particular: we can always construct a natural latent with competitively optimal approximation via resampling. Applying that construction to Q, we get a latent which is usually independent random noise (which gives the same information about X as the empty latent), but there’s a 1250 chance that it contains the value of X1 and another 1250 chance that it contains the value of X2. Similarly, we can use the resampling construction to find a natural latent for P, and it will have a 1250 chance of containing random noise instead of X1, and an independent 1250 chance of containing random noise instead of X2.
Those two latents still differ in their information content about X by roughly 1 million bits, but the distribution of Xgiven each latent differs by only about 100 bits in expectation. Intuitively: while the agents still strongly disagree about the distribution of their respective latents, they agree (to within ~100 bits) on what each value of the latent says about X.
Does that generalize beyond this one example? We don’t know yet.
But if it turns out that the competitively optimal natural latent is generally robust to updates, in some sense, then it might make sense to add a robustness-to-updates requirement for natural latents—require that we use the “right” natural latent, in order to handle this sort of problem.
Same Distribution
A third possible approach is to formulate the theory around a single distribution P[X].
For instance, we could assume that the environment follows some “true distribution”, and both agents look for latents which are approximately natural over the “true distribution” (as far as they can tell, since the agents can’t observe the whole environment distribution directly). This would probably end up with a Fristonian flavor.
ADDED July 9: The Competitively Optimal Natural Latent from Resampling Always Works (At Least Mediocrely)
Recall that, for a distribution P[X1,...,Xn], we can always construct a competitively optimal natural latent (under strong redundancy) X′ by resampling each component Xi conditional on the others X¯i, i.e.
P[X=x,X′=x′]:=P[X=x]∏iP[Xi=x′i|X¯i=x¯i]
We argued above that this specific natural latent works just fine in the tiny mixtures counterexample: roughly speaking, the resampling natural latent constructed for P approximates the resampling natural latent constructed for Q (to within an error comparable to how well P approximates Q).
Now we’ll show that that generalizes. Our bound will be mediocre, but it’s any bound at all, so that’s progress.
Specifically: suppose we have two distributions over the same variables, P[X1,...,Xn] and Q[X1,...,Xn]. We construct a competitively optimal natural latent X′ via resampling for each distribution:
P[X=x,X′=x′]:=P[X=x]∏iP[Xi=x′i|X¯i=x¯i]
Q[X=x,X′=x′]:=Q[X=x]∏iQ[Xi=x′i|X¯i=x¯i]
Then, we’ll use E[DKL(P[X′|X]||Q[X′|X])] (with expectation taken over X under distribution P) as a measure of how well Q‘s latent X′ matches P’s latent X′. Core result:
So we have a bound. Unfortunately, the factor of n (number of variables) makes the bound kinda mediocre. We could sidestep that problem in practice by just using natural latents over a small number of variables at any given time (which is actually fine for many and arguably most use cases). But based on the proof, it seems like we should be able to improve a lot on that factor of n; we outright add ∑iDKL(P[X¯i]||Q[X¯i])), which should typically be much larger than the quantity we’re trying to bound.
Natural Latents Are Not Robust To Tiny Mixtures
In our previous natural latent posts, our core theorem typically says something like:
This post is about one potential weakness in that claim: what happens when the two agents’ predictive distributions are only approximately the same?
Following the pattern of our previous theorems, we’d ideally say something like
But that turns out to be false.
The Tiny Mixtures Counterexample
Let’s start with two distributions, P0 and Q0, over X. These won’t be our two agents’ distributions—we’re going to construct our two agents’ distributions by mixing these two together, as the name “tiny mixtures” suggests.
P0 and Q0 will have extremely different natural latents. Specifically:
X1 consists of 1 million bits, X2 consists of another 1 million bits
Under P0, X1 is uniform, and X2=X1. So, there is an exact natural latent ΛP=X1=X2 under P0.
Under Q0, X1 and X2 are independent and uniform. So, the empty latent ΛQ is exactly natural under Q0.
Mental picture: we have a million-bit channel, under P0 the output (X2) is equal to the input (X1), while under Q0 the channel hardware is maintained by Comcast so they’re independent.
Now for our two agents’ distributions, P and Q. P will be almost P0, and Q will be almost Q0, but each agent puts a 1250 probability on the other distribution:
P=(1−1250)P0+1250Q0
Q=1250P0+(1−1250)Q0
First key observation: DKL(P||Q) and DKL(Q||P) are both roughly 50 bits. Calculation:
DKL(P||Q)=∑X1,X2P[X](logP[X]−logQ[X]) ≈∑X1=X2121000000(−1000000−log(122000000+1250121000000)≈50
DKL(Q||P)=∑X1,X2Q[X](logQ[X]−logP[X]) ≈∑X1≠X2122000000(−2000000−log(1250122000000))≈50
Intuitively: since each distribution puts roughly 1250 on the other, it takes about 50 bits of evidence to update from either one to the other.
Second key observation: the empty latent is approximately natural under Q, and the latent Λ:=X1 is approximately natural under P. Epsilons:
Under Q, the empty latent satisfies mediation to within about 1250∗1000000≈1230 bits (this is just mutual information of X1 and X2 under Q), and redundancy exactly (since the empty latent can always be exactly computed from any input).
Under P, Λ:=X1 satisfies mediation exactly (since X1 mediates between X1 and anything else), redundancy with respect to X2 exactly (Λ=X1 can be exactly computed from just X1 without X2), and redundancy with respect to X1 to within about 1250∗1000000≈1230 bits (since there’s a 1250 chance that X2 doesn’t tell us the relevant 1000000 bits).
… and of course the information those two latents tell us about X differs by 1 million bits: one of them is empty, and the other directly tells us 1 million bits about X1.
Now, let’s revisit the claim we would’ve liked to make:
Tiny mixtures rule out any claim along those lines. Generalizing the counterexample to an N bit channel (where N=1000000 above) and a mixin probability of 12M (where M=50 above), we generally see that the two latents are natural over their respective distributions to about 12MN, the DKL between the distributions is about 12M in either direction, yet one latent contains N bits of information about X while the other contains zero. By choosing 2M>>N, with both M and N large, we can get arbitrarily precise natural latents over the two distributions, with the difference in the latents exponentially large with respect to the DKL’s between distributions.
What To Do Instead?
So the bound we’d ideally like is ruled out. What alternatives might we aim for?
Different Kind of Approximation
Looking at the counterexample, one thing which stands out is that P and Q are, intuitively, very different distributions. Arguably, the problem is that a “small” DKL just doesn’t imply that the distributions are all that close together; really we should use some other kind of approximation.
On the other hand, DKL is a pretty nice principled error-measure with nice properties, and in particular it naturally plugs into information-theoretic or thermodynamic machinery. And indeed, we are hoping to plug all this theory into thermodynamic-style machinery down the road. For that, we need global bounds, and they need to be information-theoretic.
Additional Requirements for Natural Latents
Coming from another direction: a 50-bit update can turn Q into P, or vice-versa. So one thing this example shows is that natural latents, as they’re currently formulated, are not necessarily robust to even relatively small updates, since 50 bits can quite dramatically change a distribution.
Interestingly, there do exist other natural latents over these two distributions which are approximately the same (under their respective distributions) as the two natural latents we used above, but more robust (in some ways) to turning one distribution into the other. In particular: we can always construct a natural latent with competitively optimal approximation via resampling. Applying that construction to Q, we get a latent which is usually independent random noise (which gives the same information about X as the empty latent), but there’s a 1250 chance that it contains the value of X1 and another 1250 chance that it contains the value of X2. Similarly, we can use the resampling construction to find a natural latent for P, and it will have a 1250 chance of containing random noise instead of X1, and an independent 1250 chance of containing random noise instead of X2.
Those two latents still differ in their information content about X by roughly 1 million bits, but the distribution of X given each latent differs by only about 100 bits in expectation. Intuitively: while the agents still strongly disagree about the distribution of their respective latents, they agree (to within ~100 bits) on what each value of the latent says about X.
Does that generalize beyond this one example? We don’t know yet.
But if it turns out that the competitively optimal natural latent is generally robust to updates, in some sense, then it might make sense to add a robustness-to-updates requirement for natural latents—require that we use the “right” natural latent, in order to handle this sort of problem.
Same Distribution
A third possible approach is to formulate the theory around a single distribution P[X].
For instance, we could assume that the environment follows some “true distribution”, and both agents look for latents which are approximately natural over the “true distribution” (as far as they can tell, since the agents can’t observe the whole environment distribution directly). This would probably end up with a Fristonian flavor.
ADDED July 9: The Competitively Optimal Natural Latent from Resampling Always Works (At Least Mediocrely)
Recall that, for a distribution P[X1,...,Xn], we can always construct a competitively optimal natural latent (under strong redundancy) X′ by resampling each component Xi conditional on the others X¯i, i.e.
P[X=x,X′=x′]:=P[X=x]∏iP[Xi=x′i|X¯i=x¯i]
We argued above that this specific natural latent works just fine in the tiny mixtures counterexample: roughly speaking, the resampling natural latent constructed for P approximates the resampling natural latent constructed for Q (to within an error comparable to how well P approximates Q).
Now we’ll show that that generalizes. Our bound will be mediocre, but it’s any bound at all, so that’s progress.
Specifically: suppose we have two distributions over the same variables, P[X1,...,Xn] and Q[X1,...,Xn]. We construct a competitively optimal natural latent X′ via resampling for each distribution:
P[X=x,X′=x′]:=P[X=x]∏iP[Xi=x′i|X¯i=x¯i]
Q[X=x,X′=x′]:=Q[X=x]∏iQ[Xi=x′i|X¯i=x¯i]
Then, we’ll use E[DKL(P[X′|X]||Q[X′|X])] (with expectation taken over X under distribution P) as a measure of how well Q‘s latent X′ matches P’s latent X′. Core result:
E[DKL(P[X′|X]||Q[X′|X])]≤nDKL(P[X]||Q[X])
Proof:
E[DKL(P[X′|X]||Q[X′|X])]=E[DKL(∏iP[Xi=x′i|X¯i=x¯i]||∏iQ[Xi=x′i|X¯i=x¯i])]
=∑iE[DKL(P[Xi=x′i|X¯i=x¯i]||Q[Xi=x′i|X¯i=x¯i])]
=∑iE[DKL(P[Xi|X¯i]||Q[Xi|X¯i])]
≤∑i(E[DKL(P[Xi|X¯i]||Q[Xi|X¯i])]+DKL(P[X¯i]||Q[X¯i]))
=∑iDKL(P[X]||Q[X])
=nDKL(P[X]||Q[X])
So we have a bound. Unfortunately, the factor of n (number of variables) makes the bound kinda mediocre. We could sidestep that problem in practice by just using natural latents over a small number of variables at any given time (which is actually fine for many and arguably most use cases). But based on the proof, it seems like we should be able to improve a lot on that factor of n; we outright add ∑iDKL(P[X¯i]||Q[X¯i])), which should typically be much larger than the quantity we’re trying to bound.