Maxent and Abstractions: Current Best Arguments
This post is not-very-distilled and doesn’t contain much background; it’s intended for people who already have the context of at least these four posts. I’m putting it up mainly as a reference for people who might want to work directly on the math of natural abstractions, and as a technical reference post.
There’s various hints that, in most real-world cases, the distribution of low-level state given high-level natural abstractions should take the form of a maximum entropy distribution, in which:
The “features” are sums over local terms, and
The high-level variables are (isomorphic to) the Lagrange multipliers
More formally: we have a low-level causal model (aka Bayes net) . Given the high-level variables , the distribution of low-level variable values should look like
… i.e. the maximum-entropy distribution subject to constraints of the form . (Note: , , and are all vector-valued.)
This is the sort of form we see in statistical mechanics. It’s also the form which the generalized Koopman-Pitman-Darmois (gKPD) theorem seems to hint at.
I don’t yet have a fully-satisfying general argument that this is the main form which abstractions should take, but I have two partial arguments. This post will go over both of them.
Maxent Telephone Argument
Quick recap of the Telephone Theorem: information about some variable passes through a nested sequence of Markov blankets . Information about can only be lost as it propagates. In the limit, all information is either perfectly conserved or completely lost. Mathematically, in the limit for some such that with probability approaching 1 as ; is the perfectly-conserved-in-the-limit information carrier.
In this setup, we can also argue that the limiting distribution should have a maxent form. (Note: this is a hand-wavy argument, not a proper proof.)
Think about how the distribution transforms as we increment by 1. We have
First key property of this transformation: it’s a convex combination for each value, i.e. it’s mixing. Mixing, in general, cannot decrease the entropy of a distribution, only increase it or leave it the same. So, the entropy of will not decrease with .
When will the entropy stay the same? Well, our transformation may perfectly conserve some quantities. Since the transformation is linear, those quantities should have the form for some , i.e. they’re expected values. They’re conserved when with probability 1.
Intuitively, we’d expect the entropy of everything except the conserved quantities to strictly increase. So, we’d expect the distribution to approach maximum entropy subject to constraints of the form , where with probability 1 (at least in the limit of large ). Thus, we have the maxent form
(Note on the in there: I’m actually maximizing relative entropy, relative to the prior on , which is almost always what one should actually do when maximizing entropy. That results in a term. We should find that is a conserved quantity anyway, so it shouldn’t actually matter whether we include the multiplier or not; we’ll get the same answer either way.)
Shortcomings of This Argument
Obviously it’s a bit handwavy. Other than that, the main issue is that the Telephone Theorem doesn’t really leverage the spatial distribution of information; information only propagates along a single dimension. As a result, there’s not really a way to talk about the conserved ’s being a sum over local terms, i.e. .
Despite the handwaviness, it’s an easy result to verify computationally for small systems, and I have checked that it works.
Resampling + gKPD Argument
Another approach is to start from the redundancy + resampling formulation of abstractions. In this approach, we run an MCMC process on our causal model. Any information which is highly redundant in the system—i.e. the natural abstractions—is near-perfectly conserved under resampling a single variable at a time; other information is all wiped out. Call the initial (low-level) state of the MCMC process , and the final state . Then we have
… where is conserved by the resampling process with probability 1.
It turns out that factors over the same DAG as the underlying causal model:
If the conserved quantities are much lower-dimensional than itself, then we can apply the gKPD theorem: we have a factorization of , we have a low-dimensional summary statistic which summarizes all the info in relevant to , so the gKPD theorem says that the distribution must have the form
… where is a relatively-small set of “exceptional” indices, and is some fixed reference value of . This is slightly different from our intended form—there’s the exception terms, and we have rather than just . The latter problem is easily fixed by absorbing into (at the cost of possibly increasing the summary dimension by 1), so that’s not really an issue, but the exception terms are annoying. Absorbing and assuming (for convenience) no exception terms, we get the desired form:
Note that this is maxentropic subject to constraints of the form . Since the summary statistic is conserved by the resampling process, we must have , so the conservation equation is
Shortcomings of This Argument
Obviously there’s the exception terms. Other than that, the main issue with this argument is an issue with the resampling approach more generally: once we allow approximation, it’s not clear that the natural abstractions from the resampling formulation are the same natural abstractions which make the Telephone Theorem work. Both are independently useful: information dropping to zero at a distance is an easy property to leverage for planning/inference, and knowing the quantities conserved by MCMC makes MCMC-based planning and inference much more scalable. And in the limit of perfect conservation and infinite “distance”, the two match. But it’s not clear whether they match under realistic approximations, and I don’t yet have efficient methods to compute the natural abstractions both ways in large systems in order to check.
That said, resampling + gKPD does give us basically the result we want, at least for redundancy/resampling-based natural abstractions.
- Natural Abstractions: Key claims, Theorems, and Critiques by 16 Mar 2023 16:37 UTC; 228 points) (
- Research agenda: Formalizing abstractions of computations by 2 Feb 2023 4:29 UTC; 92 points) (
- The Lightcone Theorem: A Better Foundation For Natural Abstraction? by 15 May 2023 2:24 UTC; 69 points) (
- The Pragmascope Idea by 4 Aug 2022 21:52 UTC; 59 points) (
- [Appendix] Natural Abstractions: Key Claims, Theorems, and Critiques by 16 Mar 2023 16:38 UTC; 48 points) (
- World-Model Interpretability Is All We Need by 14 Jan 2023 19:37 UTC; 35 points) (
- Abstraction As Symmetry and Other Thoughts by 1 Feb 2023 6:25 UTC; 28 points) (
- AI Risk Intro 2: Solving The Problem by 22 Sep 2022 13:55 UTC; 22 points) (
- AI Risk Intro 2: Solving The Problem by 24 Sep 2022 9:33 UTC; 11 points) (EA Forum;
- 9 Jun 2022 22:12 UTC; 2 points) 's comment on How Do Selection Theorems Relate To Interpretability? by (
- 21 Nov 2022 22:37 UTC; 2 points) 's comment on The Pragmascope Idea by (
I followed approximately the technical discussion, and now I’m wondering what that would buy us if you are correct.
Max entropy distributions seem nicely behaved and well-studied, so maybe we get some computations, properties, derivation for free? (Basically applying a productive frame to the problem of abstraction)
It would reduce computing the influence of the summary statistics on the model to computing the constraints, as I’m guessing that this is the hard part in computing the max entropy distribution (?)
Are these correct, and what am I missing?
That’s basically correct; the main immediate gain is that it makes it much easier to compute abstractions and compute using abstractions.
One additional piece is that it hints towards a probably-more-fundamental derivation of the theorems in which maximum entropy plays a more central role. The maximum entropy Telephone Theorem already does that, but the resampling + gKPD approach routes awkwardly through gKPD instead; there’s probably a nice way to do it directly via constrained maximization of entropy. That, in turn, would probably yield stronger and simpler theorems.