Abstractions as Redundant Information
What is it that makes the concept of “pencil” a good abstraction?
One way to frame it: there are lots of “pencils” in the world—lots of little chunks of the world which I could look at which all contain information about pencils-in-general. Many different places from which I could gain information about “pencils”, and many different places where I can apply that information to make predictions. If I don’t know that pencils are usually made of wood with a graphite core, there are many different chunks of the world which I could look at to gain that information. If I do know that pencils are usually made of wood with a graphite core, there are many different chunks of the world where I could apply that information to predict the materials from which something is made.
Alternatively, consider a gear, spinning in a gearbox. What makes the gear’s rotational speed a good abstraction? Well, there are lots of little patches of metal comprising the gear. If I don’t know the gear’s rotation speed, I could precisely estimate it from any of the little patches. If I do know the gear’s rotation speed, I can use it to predict the rotation of many different little patches of metal within the gear.
This seems like an intuitively-reasonable notion of what makes abstractions “good” in general: there are many different places from which we can learn the information, and many different places where we can apply the information to make predictions. In other words, a good abstraction is highly redundant: it appears in many different places, and in any of those places we can use the abstract information to make predictions and/or gain more information about the abstraction in general.
In this post I’ll sketch out one way to formalize this idea mathematically, and show that it’s equivalent to the formalization of abstraction as information-at-a-distance. In particular, in the gear example: conditional on the highly redundant information (i.e. the overall rotation of the gear), the low-level rattling of far-apart chunks of metal is statistically independent. More generally, redundancy yields the same high-level abstract information as the Telephone Theorem, but in a mathematically-cleaner form, and without the need for different summaries between different variables.
Meta
Advice for readers: I try to keep the more-dense math in a few specific sections and the appendices, which can all be skimmed/skipped if you just want a conceptual picture.
Epistemic status: I’m aiming for physics-level rigor here. The proofs involve some shenanigans with infinite limits, and I expect them to contain subtle errors. However, I also expect that the results will accurately predict reality in practice, and that whatever subtle errors the proofs contain can be patched by the sort of mathematician who enjoys dealing with tricky limit shenanigans.
Thankyou to Rob Miles and Adam Shimi for suggesting I try Excalidraw.
Basic Idea: Redundant Information Is Conserved Under Resampling
Suppose I sample the genomes of two random humans, and What information is redundant across these two random variables?
One intuitively-reasonable answer: if I threw away one of the two sequences, then the redundant information is whatever I could still figure out from the other. More generally, if I have a whole bunch of genomes from random humans, , the redundant information is whatever I could still figure out after throwing one of them away.
To formalize “what I could still figure out after throwing one away”, we’ll use the idea of resampling: I throw away the value of , and then sample a new value for , using my original probability distribution conditioned on all the other genomes. So, for instance, I throw away , then I look at all the other genomes and see that in most places they’re the same—so when I sample my new I know that it should match all the other genomes in all those places. However, there are a few locations where the other genomes differ, e.g. maybe 10% of them contain a particular mutation. So, when I sample my new , it will contain that mutation with roughly 10% probability (assuming there’s enough data to swamp the impact of my priors).
Now I repeat this process many times, each time throwing away one randomly-chosen genome and resampling it conditional on the others. Intuitively, I expect that the information conserved by this process will be whatever information is highly redundant, so that approximately-zero loss occurs at each step.
General method:
Start with a bunch of random variables , a joint distribution , and a value for each variable. (See the Equations section below for more details on the notation.)
At each “timestep”, pick one variable at random, throw away its value, and resample it conditional on all the other variable values.
Run this process for a long time.
See what information about the initial conditions is conserved.
Conceptual Example: Gear
Suppose I have a gear, spinning around in a gearbox. I look at a few nanometer-size patches on the gear’s surface, just a hundred-ish atoms each, and measure the (approximate) position and velocity of each atom in each little patch. Notation: random variable gives the positions and velocities of each atom in patch .
What information is redundant across these different patches? Intuitively, I can look at any patch and average together the rotational velocities of the atoms about the gear’s center to get a reasonably-precise estimate of the gear’s overall rotation speed. If I throw away , I can still precisely estimate the gear’s overall rotation from the other ’s. Then, when I resample , I will resample it so that the average rotational velocity of the atoms in matches the gear’s overall rotation.
Equations
Notation: we’ll use for the variable-values after t steps of this process, and for variable-values after running the process for an arbitrarily long time. So, we’re mainly interested in the mutual information between and . The resampling process itself specifies the distribution .
We’ll use “model” variables and to distinguish probabilities from the “base” distribution (which is only over ) vs probabilities from the resampling process (which is over ). One potential point of confusion: both models contain a variable called “”, but these two variables are “in different scopes” (in the programmer’s sense); “” means something different depending on which model it’s in. In the base model, is just a single instance of our base random variables. In the resampling model, consists of many instances , one for each timestep , and each individual instance is distributed the same as the base model’s . We use the same variable name for both because there’s a conceptual correspondence between the two.
The resampling process defines the full relationship between the two models:
(assuming variable i is resampled at resampling-time ; for all the other variables, , since their values just stay the same)
These equations follow directly from the process outlined above, and define the distribution in terms of the distribution .
… So We’re Running MCMC?
If you’ve worked with Markov Chain Monte Carlo (MCMC) algorithms before, this should look familiar: we’re basically asking which information about the initial conditions will be conserved as we run a standard MCMC process.
If you’ve worked with MCMC algorithms before, you might also guess that this question has two answers:
In theory, assuming everywhere, the resampling-distribution always approaches as we run the process regardless of initial conditions. Since we always approach the same distribution regardless of initial conditions, no information about the initial conditions is conserved.
In practice, the time it takes for that limit to kick in increases (often exponentially) with the number of variables in the model, so lots of information is conserved in large models over any practically-achievable number of steps of the process.
In this post, we’re going to think about infinitely large models, so the process no longer converges to at all; that convergence time goes to infinity. The distribution does still converge, but the limiting distribution depends on the initial conditions, and that dependence is exactly what we’re interested in. Unfortunately, this introduces a bunch of tricky subtleties about how we take the limits: do we take the limit of infinitely many variables first, or the limit of infinite time first, or do we take them at the same time with some fixed relationship between the two? I’ll handle those subtleties mainly by ignoring them and hoping a mathematician comes along to clean it up later. Remember, we’re aiming for physics-level rigor here.
The important takeaway of this section is that we have a ton of data on how these sorts of processes actually behave in practice, thanks to the popularity of MCMC algorithms. So we don’t just have to rely on physics-level-of-rigor arguments; anyone with firsthand experience with MCMC on large models can use their intuition as a guide. (I mostly won’t explicitly talk more here about lessons from MCMC; I expect that those of you already familiar with the topic can reason it through for yourselves, and explaining the relevant experience/intuitions to people not already familiar is beyond the scope of this post.)
Worked Examples
Two Variables
Let’s start with a trivial example to show how any information at all can be conserved by the resampling process. We have two random variables, and . Our model for the two variable values is:
is an unbiased coin flip
is a record on a piece of paper indicating whether the coin came up heads or tails (“H” if the coin came up heads, “T” if the coin came up tails)
What happens when we run our resampling process on this system? Well, first we throw away the coin flip, and resample it given our record. If the record says “H”, then we know the coin came up heads, so our sampler selects heads again for ; vice-versa for tails. The first coin is therefore reset to its original value. Then, we throw away the record, and resample it given our coin. If the coin is heads, we know the record says “H”; vice-versa for tails. The record is therefore reset to its original value.
The process continues, back and forth, with each variable “storing the information” when the other is thrown away, and the information then perfectly copied back over into the resampled variable value.
On the other hand, imagine that our record-keeping is imperfect—maybe there’s a 10% chance that records the wrong value. Then, at each “timestep” of the resampling process, there’s roughly a 20% chance (10% for each variable) that we’ll lose the original value. Given, say, 100 timesteps, we’ll lose approximately-all of the information about the original values.
General point: only information which is perfectly conserved at each timestep will be conserved indefinitely; everything else is completely lost.
In general, the information perfectly conserved indefinitely will be the value of deterministic constraints, i.e. functions and such that with probability 1 in the base distribution . (We can prove this via the Telephone Theorem plus an equilibrium condition, but it’s not the main theorem of interest in this post.)
Many Measurements Of One Thing
Let’s say we have a stick of length , and take conditionally-independent measurements of . We’ll model each measurement as normally distributed with mean and standard deviation , and we’ll assume that we have enough data that the prior on doesn’t matter much (i.e. we’ll just pretend the prior is flat).
To simplify the analysis a little, we’ll resample the variables in order rather than at random—i.e. we resample conditional on all the measurements, then resample each of the measurements conditional on , and repeat.
When we resample , we draw from a normal distribution with mean equal to the average of the measurements (i.e. ) and standard deviation , so our new will be about away from the previous average. When we resample the measurements, their new average will be normally distributed with mean and standard deviation , so the new average will be about away from the previous . In other words: and the measurement average follow a random walk, drifting about per timestep. Over timesteps, they will drift a distance of about . (In general, the distance a random walk drifts scales with rather than , since it often wanders back on itself.)
So: if we run the process for a number of steps , then all information about the initial conditions is lost. On the other hand, if the number of variables , then the drift is close to zero, so and the measurement average are approximately conserved. The order in which we take our limits matters.
Practically speaking, we’re looking for information which is approximately conserved, i.e. the “timescale” over which it’s lost is large. So it makes sense to consider and the measurement average as approximately-conserved when is large. That’s our abstract information in this example.
Factorization
Now for our first theorem about this kind of resampling process.
Imagine that our base distribution is local—i.e. each variable only “directly interacts with” a few “neighbor variables”. When modeling a physical gear, for instance, each little chunk of metal only interacts directly with the chunks of metal spatially adjacent. Any longer-range interactions have to “go through” those direct interactions.
Our theorem says that interactions are still local after controlling for the information conserved by the resampling process. In the gear example, after controlling for the high-level rotation of the gear, the remaining low-level vibrations and rattling are still local; the low-level details of each chunk of metal interact directly only with the low-level details in chunks spatially adjacent.
Why does this matter? In general, locality is the main tool which lets us reason about our high-dimensional world at all. It means we can look at one part of the world, and understand what’s going on there without having to understand everything that’s happening in the whole universe. The factorization theorem says that this still applies when we condition on our high-level knowledge—in other words, we can “zoom in” on lower-level details, and add them to our high-level picture without having to understand all the other low-level details in the whole universe. Conditional on our high-level knowledge, any low-level information still has to flow through neighboring variables in order to influence things “far away” in the graph.
That’s going to be key to our next theorem, which is the main item of interest in this post.
Formal Statement
Resampler Conserves Locality: If the base distribution factors over some graph , then so does the limiting resampling distribution . This factorization theorem applies to both undirected graphical models (i.e. Markov Random Fields) and directed graphical models (i.e. Bayes Nets/Causal Models). See the appendices for a proof sketch.
In the gear example, the graph would be the adjacency graph for chunks of metal: each chunk is a node, and the edges show spatial adjacency. The factorization follows the standard formulas for factorization of Markov Random Fields or Bayes Nets, depending on which type of graphical model we’re using.
The Interesting Part: Resampler-Conserved Quantities Mediate Information At A Distance
Time for the big claim from earlier: in the gear example, conditional on the highly redundant information (i.e. the overall rotation of the gear), the low-level rattling of far-apart chunks of metal is statistically independent.
More generally: assume that our base distribution factors on a graph . Conditional on all the quantities perfectly conserved by the resampling process, variables far apart in are independent. If you’ve read the Telephone Theorem, this is basically the same, but with one big upgrade: our “high-level summary” no longer depends on which notion of “far away” we use; the same summary applies to any sequence of nonoverlapping nested Markov blankets. We can take the information conserved by the resampling process to be the “high-level abstractions” for the whole model.
Let’s unpack that. First, we’ll do a quick recap of the Telephone Theorem.
We start with a large Causal Model/Bayes Net:
Each node is a random variable, and the arrows show direct causal influence; paths show indirect causal influence. If you’ve used MCMC before, you were probably picturing something like this already; we usually use MCMC with Bayes nets, because the locality structure makes it easy to resample variables (we only need to look at neighbor values rather than the values of all variables).
Now, just like in the Telephone Theorem, we picture a sequence of nested Markov blankets in our model:
Each possible sequence of blankets cuts the graph into pieces, with each piece only connected directly to the piece before and the piece after. A choice of sequence of blankets defines a notion of “far away”—i.e. if two variables are separated by a large number of “layers” of blankets in the sequence, then they are “far apart”. In order for to have any mutual information with , that information must propagate through each of the layers in between.
The basic idea of the Telephone Theorem is that information is either perfectly conserved or completely lost as we move through enough layers; information can only propagate “far away” if the information can be perfectly computed from each layer individually.
… but if some information can be perfectly computed from each layer individually, then that information will be conserved by our resampler. When resampling , I can still perfectly calculate the information from or , and therefore the information will be perfectly conserved. So, the only information which can propagate far away is information which is perfectly conserved by resampling. That means that conditional on the information conserved by resampling, the mutual information must drop to zero.
A slightly more formal version of that argument is in the appendices, but that’s the core idea.
Formal Statement
Let satisfy , i.e. encodes the values of all conserved quantities in the resampling process. For any infinite sequence of nested nonoverlapping Markov blankets on the base model, the conditional mutual information as .
Gear Example
In the gear example, our Markov blankets might be nested layers of metal:
“Far away” then indicates moving through a large number of such layers.
In order for information to propagate far away, we must be able to compute it from each layer—i.e. we can very precisely compute the overall rotation of the gear from the overall rotation of chunks of metal in each layer. So, when we resample a chunk of metal, the overall rotation will be conserved—it will be “stored in” the other chunks.
This reasoning must apply to any information which propagates through many layers: if the information propagates far away, it will be conserved by resampling. So, assuming the overall rotation is the only quantity conserved by the resampling process, far-apart chunks of the gear must be independent given the overall rotation.
Intuitively, this lets us factor the gear into a “global” component and a “local” component. The global component, the gear’s overall rotation, is redundantly represented; it can be estimated by looking at many different little chunks, and we expect these estimates to (approximately) agree. The local component captures everything else, and is guaranteed to be “local” in the sense that far apart pieces are independent.
Conclusion
We started with the intuition that abstraction is about redundant information: there are many different places from which we can learn the information, and many different places where we can apply the information to make predictions. That’s what makes abstractions generalizable and useful.
Then, we showed that a formalization of this intuition based on resampling variables reproduces the main ideas of abstraction as information-relevant-at-a-distance. In particular, the resampling approach yields a better version of the Telephone Theorem.
Personally, I came to all this in a different order: I noticed that the Telephone Theorem required redundancy of information, figured out the resampling thing, and only backed out the intuition of abstraction-as-redundant-information later. Nonetheless, it was an exciting thing to find: when different intuitive formulations of an idea turn out to be basically-equivalent, that’s strong evidence that we’re on the right track.
In terms of applications, the locality results in particular are exactly the sort of thing which I’ve been looking for since my last update on testing the Natural Abstraction Hypothesis. In combination with the generalized Koopman-Pitman-Darmois theorem, they get us to the point where calculating abstractions from a base model is roughly-tractable, i.e. it can be done in something like time with respect to the size of the base model. That still isn’t quite efficient enough to handle big models, and unfortunately big models are exactly where we expect to see nontrivial results, so I’m still not quite at the point of running empirical tests. But it feels like the hard part is now past; there are some clear steps forward on the math, and I expect that those will basically close the gap to efficient calculation.
On the theoretical side, the first leg of the Natural Abstraction Hypothesis says:
For most physical systems, the information relevant “far away” can be represented by a summary much lower-dimensional than the system itself.
Assuming the proofs in this post basically hold up, and the loopholes aren’t critical, I think this claim is now basically proven. There’s still some operationalization to be done (e.g. the “dimension” of the summary hasn’t actually been addressed yet), loopholes to close (e.g. deterministic computation makes things tricky), some legwork to flesh it all out (e.g. including numerical approximation), various extensions (e.g. logical uncertainty), and a lot of distillation needed, but I think this math is enough to conclude that the basic claim is probably true in worlds following local laws (like ours).
The basic idea is also useful even in nonlocal models, which I’ll hopefully write about in the not-too-distant future. That form is more readily applicable to clustering-style applications, like e.g. recognizing “pencils” as a kind of object.
Appendices
Proof Sketch: Factorization
We’ll use three facts. First, is invariant under resampling any variable—i.e. the distribution reaches an equilibrium as we run the resampling process. I won’t actually prove that, because I don’t expect that there’s anything new involved; standard MCMC convergence proofs should largely carry over (allowing for conserved quantities, of course). Formally:
as
Second, the resampler is local:
… and
… so
… where indicates the neighbors of in the graph on which factors.
Third, screens off from (thus the “Markov Chain” part of “Markov Chain Monte Carlo). So, .
Combining these three, we find
as
… i.e. the neighbors which screen off from everything else in the base model also screen off from everything else in , given . The Hammersley-Clifford Theorem tells us that this is a sufficient condition for to factor over the graph (modulo taking some limits).
Note that the Hammersley-Clifford theorem applies to undirected graphical models. I won’t prove the extension of our theorem to Bayes Nets here because, again, there’s nothing particularly novel involved.
Proof Sketch: Resampler Telephone Theorem
Once we have locality of given , this theorem is easy: it’s exactly like the Telephone Theorem, but on the distribution rather than . Because factors over the same graph as , we can pick any sequence of nonoverlapping nested Markov blankets in the base model, carry them over directly to , and apply the Telephone Theorem argument:
The relationship between and is mediated by , so . In other words, mutual information with B_1 decreases as we move outward through the layers.
MI is nonnegative and decreasing, so it must approach a limit as .
Once the MI is arbitrarily close to that limit, information about is arbitrarily perfectly conserved.
Information is perfectly conserved only when the information is carried by a deterministic constraint, i.e. where with probability 1 for some .
The key thing to notice here is the deterministic constraint . In the two-variable example, we saw that exactly this kind of information is conserved by our resampling process: when we resample variables in , the constraint value is perfectly “remembered” by , and when we resamples variables in , the constraint value is perfectly “remembered” by . Newly-generated sample values are forced to be perfectly consistent with the constraint value, so that information sticks around.
So, if with probability 1, and the blankets and are nonoverlapping, then the value is perfectly conserved when resampling. It’s perfectly conserved by the variables in when resampling a variable outside of , and perfectly conserved by the variables in when resampling a variable outside of . So, conditional on information conserved by the resampling process (or, equivalently for purposes of the distribution, conditional on ), the value of is known; it does not give any information at all. So, conditional on quantities conserved by resampling, the mutual information must drop to zero in the limit of large .
- Natural Abstractions: Key claims, Theorems, and Critiques by 16 Mar 2023 16:37 UTC; 237 points) (
- Call For Distillers by 4 Apr 2022 18:25 UTC; 206 points) (
- Research agenda: Formalizing abstractions of computations by 2 Feb 2023 4:29 UTC; 93 points) (
- Call For Distillers by 6 Apr 2022 3:03 UTC; 70 points) (EA Forum;
- The Lightcone Theorem: A Better Foundation For Natural Abstraction? by 15 May 2023 2:24 UTC; 69 points) (
- Agency As a Natural Abstraction by 13 May 2022 18:02 UTC; 55 points) (
- The “Minimal Latents” Approach to Natural Abstractions by 20 Dec 2022 1:22 UTC; 53 points) (
- 2022 (and All Time) Posts by Pingback Count by 16 Dec 2023 21:17 UTC; 53 points) (
- Good ontologies induce commutative diagrams by 9 Oct 2022 0:06 UTC; 49 points) (
- [Appendix] Natural Abstractions: Key Claims, Theorems, and Critiques by 16 Mar 2023 16:38 UTC; 48 points) (
- [Intro to brain-like-AGI safety] 14. Controlled AGI by 11 May 2022 13:17 UTC; 45 points) (
- Scarce Channels and Abstraction Coupling by 28 Feb 2023 23:26 UTC; 41 points) (
- 4 Apr 2022 21:19 UTC; 41 points) 's comment on Call For Distillers by (
- World-Model Interpretability Is All We Need by 14 Jan 2023 19:37 UTC; 35 points) (
- What Does The Natural Abstraction Framework Say About ELK? by 15 Feb 2022 2:27 UTC; 35 points) (
- The Big Picture Of Alignment (Talk Part 2) by 25 Feb 2022 2:53 UTC; 34 points) (
- Maxent and Abstractions: Current Best Arguments by 18 May 2022 19:54 UTC; 33 points) (
- Abstraction As Symmetry and Other Thoughts by 1 Feb 2023 6:25 UTC; 28 points) (
- Towards Gears-Level Understanding of Agency by 16 Jun 2022 22:00 UTC; 25 points) (
- AI Risk Intro 2: Solving The Problem by 22 Sep 2022 13:55 UTC; 22 points) (
- Towards the Operationalization of Philosophy & Wisdom by 28 Oct 2024 19:45 UTC; 20 points) (
- AI Risk Intro 2: Solving The Problem by 24 Sep 2022 9:33 UTC; 11 points) (EA Forum;
- Investigating Basin Volume with XOR Networks by 10 Mar 2024 1:35 UTC; 10 points) (
- 5 Aug 2022 15:38 UTC; 8 points) 's comment on The Pragmascope Idea by (
- 11 May 2022 17:03 UTC; 7 points) 's comment on [Intro to brain-like-AGI safety] 14. Controlled AGI by (
- 11 May 2022 21:21 UTC; 4 points) 's comment on [Intro to brain-like-AGI safety] 14. Controlled AGI by (
- 15 Feb 2023 17:31 UTC; 3 points) 's comment on Abstraction As Symmetry and Other Thoughts by (
- Towards the Operationalization of Philosophy & Wisdom by 28 Oct 2024 19:45 UTC; 1 point) (EA Forum;
(Warning: I might not be describing this well. And might be stupid.)
I feel like there’s an alternate perspective compared to what you’re doing, and I’m trying to understand why you don’t take that path. Something like: You’re taking the perspective that there is one Bayes net that we want to understand. And the alternative perspective is: there is a succession of “scenarios”, and each is its own Bayes net, and there are deep regularities connecting them—for example they all obey the laws of physics, there’s some relation between the “chairs” in one scenario and future scenarios, etc.
In the “multiple scenarios” perspective, we can frame everything in terms of building good generative models, that predict missing data from limited data, or predict the future from the past. It seems like “resampling” is unnecessary in this perspective; we can evaluate generative models by whether they work well in the next scenario. Or as another example, we would learn a generative model of a gear that involved rigid rotation plus small-amplitude vibration, and when we see something that seems to match that model, we would guess that the model is applicable here. Etc.
Then a claim about far-away information would look something like “the generative model is structured as a bunch of sub-items with coordinates, and these items have local interactions”. And then you can make a substantive claim: “This class of generative models is a very powerful model class, really good at making predictions given a fixed model complexity”. Is that claim true? In some senses yes, but we can also immediately see limitations, e.g. “it is nighttime” is a useful generative model ingredient that doesn’t really have coordinates, and likewise “illegal”, etc.
The “redundant information” framing still applies; if a comparatively simple generative model can make lots of correct predictions, then clearly there was redundant information in what was predicted.
Anyway, my question is: am I correct that we can define abstractions as “ingredients in generative models”, and if so, why don’t you like that approach? (Or is it equivalent to your approach??)
You can think of everything I’m doing as occurring in a “God’s eye” model. I expect that an agent embedded in this God’s-eye model will only be able to usefully measure natural abstractions within the model. So, shifting to the agent’s perspective, we could say “holding these abstractions fixed, what possible models are compatible with them?”. And that is indeed a direction I plan to go. But first, I want to get the nicest math I possibly can for computing the abstractions within a model, because the cleaner that is the cleaner I expect that computing possible models from the abstractions will be.
… that was kinda rambly, but I guess the summary is “building good generative problems is the inverse problem for the approach I’m currently focused on, and I expect that the cleaner this problem is solved the easier it will be to handle its inverse problem”.
This seems similar to the SR model of scientific explanation.
This post reminds me of a self-supervised machine learning model I wanted to make a while ago, but I never got around to.
Essentially the idea was to split the data up into a “small connected region” vs “everything else”. For instance, if the data was images, I’d pick out a small patch of the image vs the rest of the image. And then I’d predict the distribution of the small region from the everything else. The logic was that one could then invert the process, by taking a patch and predicting the distribution that this patch came from; this would tell you the features of the patch that correlate with far-away information.
More formally, my idea was to have two neural networks, which I called F and G, where F takes an image and outputs a feature vector, and G takes a feature vector and somehow functions as a generative model (in my original proposal I chose G to Image-GPT, but you could also imagine training G as a GAN or something). Let x be an image, C(x) be a small region from that image, and let M(x) be most of x with C(x) masked away, and let P(y|G(f)) be the probability of y according to G conditioned on f.
If we then consider the expression P(C(x)|G(F(M(x)))), then this essentially amounts to, how well is a region of an image predicted by features “far away”. So you can optimize this to get a model that does the sort of thing that your post talks about.
My idea was then that F might be kinda messy because id there’s a lot of redundant information, F doesn’t need to capture all copies of it, and so we probably shouldn’t use F. But since G is a generative model, it is incentivized to put in the redundant information to each of its output variables where the information would occur. So we would expect G to be quite robust.
If we wanted to abstract the important faraway features of an image while leaving behind the unimportant noise, we could thus just invert G, essentially asking “what faraway-relevant features might account for this patch?”.
I wrote more discussion about it in the #general channel of the EleutherAI discord back the 12th-13th of March, 2021.
Looks like all the followups to BEiT masked autoencoder image modeling are doing similar things: https://arxiv.org/abs/2202.03382 https://arxiv.org/abs/2202.03026 https://arxiv.org/abs/2202.04200
Summary
In this post, John starts with a very basic intuition: that abstractions are things you can get from many places in the world, which are therefore very redundant. Thus, for finding abstractions, you should first define redundant information: Concretely, for a system of n random variables X1, …, Xn, he defines the redundant information as that information that remains about the original after repeatedly resampling one variable at a time while keeping all the others fixed. Since there will not be any remaining information if n is finite, there is also the somewhat vague assumption that the number of variables goes to infinity in that resampling process.
The first main theorem says that this resampling process will not break the graphical structure of the original variables, i.e., if X1, …, Xn form a Markov random field or Bayesian network with respect to a graph, then the resampled variables will as well, even when conditioning on the abstraction of them. John’s interpretation is that you will still be able to make inferences about the world in a local way even if you condition on your high-level understanding (i.e., the information preserved by the resampling process)
The second main theorem applies this to show that any abstraction F(X1, …, Xn) that contains all the information remaining from the resampling process will also contain all the abstract summaries from the telephone theorem for all the ways that X1, …, Xn (with n going to infinity) could be decomposed into infinitely many nested Markov blankets. This makes F a supposedly quite powerful abstraction.
Further Thoughts
It’s fairly unclear how exactly the resampling process should be defined. If n is finite and fixed, then John writes that no information will remain. If, however, n is infinite from the start, then we should (probably?) expect the mutual information between the original random variable and the end result to also often be infinite, which also means that we should not expect a small abstract summary F.
Leaving that aside, it is in general not clear to me how F is obtained. The second theorem just assumes F and deduces that it contains the information from the abstract summaries of all telephone theorems. The hope is that F is low-dimensional and thus manageable. But no attempt is made to show the existence of a low-dimensional F in any realistic setting.
Another remark: I don’t quite understand what it means to resample one of the variables “in the physical world”. My understanding is as follows, and if anyone can correct it, that would be helpful: We have some “prior understanding” (= prior probability) about how the world works, and by measuring aspects in the world — e.g., patches full of atoms in a gear — we gain “data” from that prior probability distribution. When forgetting the data of one of the patches, we can look at the others and then use our physical understanding to predict the values for the lost patch. We then sample from that prediction.
Is that it? If so, then this resampling process seems very observer-dependent since there is probably no actual randomness in the universe. But if it is observer-dependent, then the resulting abstractions would also be observer-dependent, which seems to undermine the hope to obtain natural abstractions.
I also have a similar concern about the pencils example: if you have a prior on variables X1, …, Xn and you know that all of them will end up to be “objects of the same type”, and a joint sample of them gives you n pencils, then it makes sense to me that resampling them one by one until infinity will still give you a bunch of pencil-like objects, leading you to conclude that the underlying preserved information is a graphite core inside wood. However, where do the variables X1, …, Xn come from in the first place? Each Xi is already a high-level object and it is unclear to me what the analysis would look like if one reparameterized that space. (Thanks to Erik Jenner for mentioning that thought to me; there is a chance that I misrepresent his thinking, though.)
This might not work depending on the details of how “information” is specified in these examples, but would this model of abstractions consider “blob of random noise” a good abstraction?
On the one hand, different blobs of random noise contain no information about each other on a particle level—in fact, they contain no information about anything on a particle level, if the noise is “truly” random. And yet they seem like a natural category, since they have “higher-level properties” in common, such as unpredictability and idk maybe mean/sd of particle velocities or something.
This is basically my attempt to produce an illustrative example for my worry that mutual information might not be sufficient to capture the relationships between abstractions that make them good abstractions, such as “usefulness” or other higher-level properties.
If they have mean/sd in common (as in e.g. a Gaussian clustering problem), then the mean/sd are exactly the abstract information. If they’re all completely independent, without any latents (like mean/sd) at all, then the blob itself is not a natural abstraction, at least if we’re staying within an information-theoretic playground.
I do expect this will eventually need to be extended beyond mutual information, especially to handle the kinds of abstractions we use in math (like “groups”, for instance). My guess is that most of the structure will carry over; Bayes nets and mutual information have pretty natural category-theoretic extensions as I understand it, and I expect that roughly the same approach and techniques I use here will extend to that setting. I don’t personally have enough expertise there to do it myself, though.
I can’t really tell what distribution(s) you’re talking about here. You describe G_1 and G_2 as two random humans; wouldn’t these then just be two draws from the same distribution of all human genomes? If not, what distributions are you talking about? Certainly, for a single human, there’s only one genome, not a distribution (presuming you’re not talking about chimerism or whatever). Are you trying to describe parameterized distribution of human genomes, where you have a prior over the parameter values and where you draw repeatedly from the genome distribution, updating your prior over the parameters?