This post is not-very-distilled and doesn’t contain much background; it’s intended for people who already have the context of at leastthesefourposts. I’m putting it up mainly as a reference for people who might want to work directly on the math of natural abstractions, and as a technical reference post.
There’s various hints that, in most real-world cases, the distribution of low-level state given high-level natural abstractions should take the form of a maximum entropy distribution, in which:
The “features” are sums over local terms, and
The high-level variables are (isomorphic to) the Lagrange multipliers
More formally: we have a low-level causal model (aka Bayes net) P[XL]=∏iP[XLi|XLpa(i)]. Given the high-level variables XH, the distribution of low-level variable values should look like
P[XL|XH]=1ZP[XL]eλT(XH)∑ifi(XLi,XLpa(i))
… i.e. the maximum-entropy distribution subject to constraints of the form E[∑ifi(XLi,XLpa(i))|XH]=μ(XH). (Note: λ, fi, and μ are all vector-valued.)
I don’t yet have a fully-satisfying general argument that this is the main form which abstractions should take, but I have two partial arguments. This post will go over both of them.
Maxent Telephone Argument
Quick recap of the Telephone Theorem: information about some variable X passes through a nested sequence of Markov blankets M1,M2,…. Information about X can only be lost as it propagates. In the limit, all information is either perfectly conserved or completely lost. Mathematically, in the limit P[X|Mn]=P[X|Fn(Mn)] for some F such that Fn(Mn)=Fn+1(Mn+1) with probability approaching 1 as n→∞; F is the perfectly-conserved-in-the-limit information carrier.
In this setup, we can also argue that the limiting distribution limn→∞P[X|Mn] should have a maxent form. (Note: this is a hand-wavy argument, not a proper proof.)
Think about how the distribution (x↦P[X=x|Mn]) transforms as we increment n by 1. We have
P[X|Mn+1]=∑MnP[X|Mn]P[Mn|Mn+1]
First key property of this transformation: it’s a convex combination for each Mn+1 value, i.e. it’s mixing. Mixing, in general, cannot decrease the entropy of a distribution, only increase it or leave it the same. So, the entropy of P[X|Mn] will not decrease with n.
When will the entropy stay the same? Well, our transformation may perfectly conserve some quantities. Since the transformation is linear, those quantities should have the form ∑Xf(X)P[X|Mn] for some f, i.e. they’re expected values. They’re conserved when E[f(X)|Mn]=E[f(X)|Mn+1] with probability 1.
Intuitively, we’d expect the entropy of everything except the conserved quantities to strictly increase. So, we’d expect the distribution P[X|Mn] to approach maximum entropy subject to constraints of the form E[f(X)|Mn]=μ(Mn), where E[f(X)|Mn]=E[f(X)|Mn+1] with probability 1 (at least in the limit of large n). Thus, we have the maxent form
P[X|Mn]=1ZP[X]eλT(Mn)f(X)
(Note on the P[X] in there: I’m actually maximizing relative entropy, relative to the prior on X, which is almost always what one should actually do when maximizing entropy. That results in a P[X] term. We should find that E[lnP[X]|Mn] is a conserved quantity anyway, so it shouldn’t actually matter whether we include the P[X] multiplier or not; we’ll get the same answer either way.)
Shortcomings of This Argument
Obviously it’s a bit handwavy. Other than that, the main issue is that the Telephone Theorem doesn’t really leverage the spatial distribution of information; information only propagates along a single dimension. As a result, there’s not really a way to talk about the conserved f’s being a sum over local terms, i.e. f(X)=∑ifi(Xi,Xpa(i)).
Despite the handwaviness, it’s an easy result to verify computationally for small systems, and I have checked that it works.
Resampling + gKPD Argument
Another approach is to start from the redundancy + resampling formulation of abstractions. In this approach, we run an MCMC process on our causal model. Any information which is highly redundant in the system—i.e. the natural abstractions—is near-perfectly conserved under resampling a single variable at a time; other information is all wiped out. Call the initial (low-level) state of the MCMC process X0, and the final state X. Then we have
… where F is conserved by the resampling process with probability 1.
It turns out that P[X|X0] factors over the same DAG as the underlying causal model:
P[X|X0]=∏iP[Xi|Xpa(i),X0]
If the conserved quantities F(X) are much lower-dimensional than X itself, then we can apply the gKPD theorem: we have a factorization of P[X|X0], we have a low-dimensional summary statistic F(X) which summarizes all the info in X relevant to X0, so the gKPD theorem says that the distribution must have the form
… where E is a relatively-small set of “exceptional” indices, and (X0)∗ is some fixed reference value of X0. This is slightly different from our intended form—there’s the exception terms, and we have ∏i∉EP[Xi|Xpa(i),X0=(X0)∗] rather than just ∏i∉EP[Xi|Xpa(i)]. The latter problem is easily fixed by absorbing ∏i∉EP[Xi|Xpa(i),X0=(X0)∗]P[Xi|Xpa(i)] into f (at the cost of possibly increasing the summary dimension by 1), so that’s not really an issue, but the exception terms are annoying. Absorbing and assuming (for convenience) no exception terms, we get the desired form:
P[X|X0]=1ZeλT(X0)∑ifi(Xi,Xpa(i))P[X]
Note that this is maxentropic subject to constraints of the form E[∑ifi(Xi,Xpa(i))|X0]=μ(X0). Since the summary statistic F(X)=∑ifi(Xi,Xpa(i)) is conserved by the resampling process, we must have μ(X0)=∑ifi(X0i,X0pa(i)), so the conservation equation is
E[∑ifi(Xi,Xpa(i))|X0]=∑ifi(X0i,X0pa(i))
Shortcomings of This Argument
Obviously there’s the exception terms. Other than that, the main issue with this argument is an issue with the resampling approach more generally: once we allow approximation, it’s not clear that the natural abstractions from the resampling formulation are the same natural abstractions which make the Telephone Theorem work. Both are independently useful: information dropping to zero at a distance is an easy property to leverage for planning/inference, and knowing the quantities conserved by MCMC makes MCMC-based planning and inference much more scalable. And in the limit of perfect conservation and infinite “distance”, the two match. But it’s not clear whether they match under realistic approximations, and I don’t yet have efficient methods to compute the natural abstractions both ways in large systems in order to check.
That said, resampling + gKPD does give us basically the result we want, at least for redundancy/resampling-based natural abstractions.
Maxent and Abstractions: Current Best Arguments
This post is not-very-distilled and doesn’t contain much background; it’s intended for people who already have the context of at least these four posts. I’m putting it up mainly as a reference for people who might want to work directly on the math of natural abstractions, and as a technical reference post.
There’s various hints that, in most real-world cases, the distribution of low-level state given high-level natural abstractions should take the form of a maximum entropy distribution, in which:
The “features” are sums over local terms, and
The high-level variables are (isomorphic to) the Lagrange multipliers
More formally: we have a low-level causal model (aka Bayes net) P[XL]=∏iP[XLi|XLpa(i)]. Given the high-level variables XH, the distribution of low-level variable values should look like
P[XL|XH]=1ZP[XL]eλT(XH)∑ifi(XLi,XLpa(i))
… i.e. the maximum-entropy distribution subject to constraints of the form E[∑ifi(XLi,XLpa(i))|XH]=μ(XH). (Note: λ, fi, and μ are all vector-valued.)
This is the sort of form we see in statistical mechanics. It’s also the form which the generalized Koopman-Pitman-Darmois (gKPD) theorem seems to hint at.
I don’t yet have a fully-satisfying general argument that this is the main form which abstractions should take, but I have two partial arguments. This post will go over both of them.
Maxent Telephone Argument
Quick recap of the Telephone Theorem: information about some variable X passes through a nested sequence of Markov blankets M1,M2,…. Information about X can only be lost as it propagates. In the limit, all information is either perfectly conserved or completely lost. Mathematically, in the limit P[X|Mn]=P[X|Fn(Mn)] for some F such that Fn(Mn)=Fn+1(Mn+1) with probability approaching 1 as n→∞; F is the perfectly-conserved-in-the-limit information carrier.
In this setup, we can also argue that the limiting distribution limn→∞P[X|Mn] should have a maxent form. (Note: this is a hand-wavy argument, not a proper proof.)
Think about how the distribution (x↦P[X=x|Mn]) transforms as we increment n by 1. We have
P[X|Mn+1]=∑MnP[X|Mn]P[Mn|Mn+1]
First key property of this transformation: it’s a convex combination for each Mn+1 value, i.e. it’s mixing. Mixing, in general, cannot decrease the entropy of a distribution, only increase it or leave it the same. So, the entropy of P[X|Mn] will not decrease with n.
When will the entropy stay the same? Well, our transformation may perfectly conserve some quantities. Since the transformation is linear, those quantities should have the form ∑Xf(X)P[X|Mn] for some f, i.e. they’re expected values. They’re conserved when E[f(X)|Mn]=E[f(X)|Mn+1] with probability 1.
Intuitively, we’d expect the entropy of everything except the conserved quantities to strictly increase. So, we’d expect the distribution P[X|Mn] to approach maximum entropy subject to constraints of the form E[f(X)|Mn]=μ(Mn), where E[f(X)|Mn]=E[f(X)|Mn+1] with probability 1 (at least in the limit of large n). Thus, we have the maxent form
P[X|Mn]=1ZP[X]eλT(Mn)f(X)
(Note on the P[X] in there: I’m actually maximizing relative entropy, relative to the prior on X, which is almost always what one should actually do when maximizing entropy. That results in a P[X] term. We should find that E[lnP[X]|Mn] is a conserved quantity anyway, so it shouldn’t actually matter whether we include the P[X] multiplier or not; we’ll get the same answer either way.)
Shortcomings of This Argument
Obviously it’s a bit handwavy. Other than that, the main issue is that the Telephone Theorem doesn’t really leverage the spatial distribution of information; information only propagates along a single dimension. As a result, there’s not really a way to talk about the conserved f’s being a sum over local terms, i.e. f(X)=∑ifi(Xi,Xpa(i)).
Despite the handwaviness, it’s an easy result to verify computationally for small systems, and I have checked that it works.
Resampling + gKPD Argument
Another approach is to start from the redundancy + resampling formulation of abstractions. In this approach, we run an MCMC process on our causal model. Any information which is highly redundant in the system—i.e. the natural abstractions—is near-perfectly conserved under resampling a single variable at a time; other information is all wiped out. Call the initial (low-level) state of the MCMC process X0, and the final state X. Then we have
P[X|X0]=P[X|F(X0)]=P[X|F(X)]P[F(X)|F(X0)]=1ZP[X]I[F(X)=F(X0)]
… where F is conserved by the resampling process with probability 1.
It turns out that P[X|X0] factors over the same DAG as the underlying causal model:
P[X|X0]=∏iP[Xi|Xpa(i),X0]
If the conserved quantities F(X) are much lower-dimensional than X itself, then we can apply the gKPD theorem: we have a factorization of P[X|X0], we have a low-dimensional summary statistic F(X) which summarizes all the info in X relevant to X0, so the gKPD theorem says that the distribution must have the form
P[X|X0]=1ZeλT(X0)∑i∉Efi(Xi,Xpa(i))∏i∉EP[Xi|Xpa(i),X0=(X0)∗]∏i∈EP[Xi|Xpa(i),X0=X0]
… where E is a relatively-small set of “exceptional” indices, and (X0)∗ is some fixed reference value of X0. This is slightly different from our intended form—there’s the exception terms, and we have ∏i∉EP[Xi|Xpa(i),X0=(X0)∗] rather than just ∏i∉EP[Xi|Xpa(i)]. The latter problem is easily fixed by absorbing ∏i∉EP[Xi|Xpa(i),X0=(X0)∗]P[Xi|Xpa(i)] into f (at the cost of possibly increasing the summary dimension by 1), so that’s not really an issue, but the exception terms are annoying. Absorbing and assuming (for convenience) no exception terms, we get the desired form:
P[X|X0]=1ZeλT(X0)∑ifi(Xi,Xpa(i))P[X]
Note that this is maxentropic subject to constraints of the form E[∑ifi(Xi,Xpa(i))|X0]=μ(X0). Since the summary statistic F(X)=∑ifi(Xi,Xpa(i)) is conserved by the resampling process, we must have μ(X0)=∑ifi(X0i,X0pa(i)), so the conservation equation is
E[∑ifi(Xi,Xpa(i))|X0]=∑ifi(X0i,X0pa(i))
Shortcomings of This Argument
Obviously there’s the exception terms. Other than that, the main issue with this argument is an issue with the resampling approach more generally: once we allow approximation, it’s not clear that the natural abstractions from the resampling formulation are the same natural abstractions which make the Telephone Theorem work. Both are independently useful: information dropping to zero at a distance is an easy property to leverage for planning/inference, and knowing the quantities conserved by MCMC makes MCMC-based planning and inference much more scalable. And in the limit of perfect conservation and infinite “distance”, the two match. But it’s not clear whether they match under realistic approximations, and I don’t yet have efficient methods to compute the natural abstractions both ways in large systems in order to check.
That said, resampling + gKPD does give us basically the result we want, at least for redundancy/resampling-based natural abstractions.