In this post, John starts with a very basic intuition: that abstractions are things you can get from many places in the world, which are therefore very redundant. Thus, for finding abstractions, you should first define redundant information: Concretely, for a system of n random variables X1, …, Xn, he defines the redundant information as that information that remains about the original after repeatedly resampling one variable at a time while keeping all the others fixed. Since there will not be any remaining information if n is finite, there is also the somewhat vague assumption that the number of variables goes to infinity in that resampling process.
The first main theorem says that this resampling process will not break the graphical structure of the original variables, i.e., if X1, …, Xn form a Markov random field or Bayesian network with respect to a graph, then the resampled variables will as well, even when conditioning on the abstraction of them. John’s interpretation is that you will still be able to make inferences about the world in a local way even if you condition on your high-level understanding (i.e., the information preserved by the resampling process)
The second main theorem applies this to show that any abstraction F(X1, …, Xn) that contains all the information remaining from the resampling process will also contain all the abstract summaries from the telephone theorem for all the ways that X1, …, Xn (with n going to infinity) could be decomposed into infinitely many nested Markov blankets. This makes F a supposedly quite powerful abstraction.
Further Thoughts
It’s fairly unclear how exactly the resampling process should be defined. If n is finite and fixed, then John writes that no information will remain. If, however, n is infinite from the start, then we should (probably?) expect the mutual information between the original random variable and the end result to also often be infinite, which also means that we should not expect a small abstract summary F.
Leaving that aside, it is in general not clear to me how F is obtained. The second theorem just assumes F and deduces that it contains the information from the abstract summaries of all telephone theorems. The hope is that F is low-dimensional and thus manageable. But no attempt is made to show the existence of a low-dimensional F in any realistic setting.
Another remark: I don’t quite understand what it means to resample one of the variables “in the physical world”. My understanding is as follows, and if anyone can correct it, that would be helpful: We have some “prior understanding” (= prior probability) about how the world works, and by measuring aspects in the world — e.g., patches full of atoms in a gear — we gain “data” from that prior probability distribution. When forgetting the data of one of the patches, we can look at the others and then use our physical understanding to predict the values for the lost patch. We then sample from that prediction.
Is that it? If so, then this resampling process seems very observer-dependent since there is probably no actual randomness in the universe. But if it is observer-dependent, then the resulting abstractions would also be observer-dependent, which seems to undermine the hope to obtain natural abstractions.
I also have a similar concern about the pencils example: if you have a prior on variables X1, …, Xn and you know that all of them will end up to be “objects of the same type”, and a joint sample of them gives you n pencils, then it makes sense to me that resampling them one by one until infinity will still give you a bunch of pencil-like objects, leading you to conclude that the underlying preserved information is a graphite core inside wood. However, where do the variables X1, …, Xn come from in the first place? Each Xi is already a high-level object and it is unclear to me what the analysis would look like if one reparameterized that space. (Thanks to Erik Jenner for mentioning that thought to me; there is a chance that I misrepresent his thinking, though.)
Summary
In this post, John starts with a very basic intuition: that abstractions are things you can get from many places in the world, which are therefore very redundant. Thus, for finding abstractions, you should first define redundant information: Concretely, for a system of n random variables X1, …, Xn, he defines the redundant information as that information that remains about the original after repeatedly resampling one variable at a time while keeping all the others fixed. Since there will not be any remaining information if n is finite, there is also the somewhat vague assumption that the number of variables goes to infinity in that resampling process.
The first main theorem says that this resampling process will not break the graphical structure of the original variables, i.e., if X1, …, Xn form a Markov random field or Bayesian network with respect to a graph, then the resampled variables will as well, even when conditioning on the abstraction of them. John’s interpretation is that you will still be able to make inferences about the world in a local way even if you condition on your high-level understanding (i.e., the information preserved by the resampling process)
The second main theorem applies this to show that any abstraction F(X1, …, Xn) that contains all the information remaining from the resampling process will also contain all the abstract summaries from the telephone theorem for all the ways that X1, …, Xn (with n going to infinity) could be decomposed into infinitely many nested Markov blankets. This makes F a supposedly quite powerful abstraction.
Further Thoughts
It’s fairly unclear how exactly the resampling process should be defined. If n is finite and fixed, then John writes that no information will remain. If, however, n is infinite from the start, then we should (probably?) expect the mutual information between the original random variable and the end result to also often be infinite, which also means that we should not expect a small abstract summary F.
Leaving that aside, it is in general not clear to me how F is obtained. The second theorem just assumes F and deduces that it contains the information from the abstract summaries of all telephone theorems. The hope is that F is low-dimensional and thus manageable. But no attempt is made to show the existence of a low-dimensional F in any realistic setting.
Another remark: I don’t quite understand what it means to resample one of the variables “in the physical world”. My understanding is as follows, and if anyone can correct it, that would be helpful: We have some “prior understanding” (= prior probability) about how the world works, and by measuring aspects in the world — e.g., patches full of atoms in a gear — we gain “data” from that prior probability distribution. When forgetting the data of one of the patches, we can look at the others and then use our physical understanding to predict the values for the lost patch. We then sample from that prediction.
Is that it? If so, then this resampling process seems very observer-dependent since there is probably no actual randomness in the universe. But if it is observer-dependent, then the resulting abstractions would also be observer-dependent, which seems to undermine the hope to obtain natural abstractions.
I also have a similar concern about the pencils example: if you have a prior on variables X1, …, Xn and you know that all of them will end up to be “objects of the same type”, and a joint sample of them gives you n pencils, then it makes sense to me that resampling them one by one until infinity will still give you a bunch of pencil-like objects, leading you to conclude that the underlying preserved information is a graphite core inside wood. However, where do the variables X1, …, Xn come from in the first place? Each Xi is already a high-level object and it is unclear to me what the analysis would look like if one reparameterized that space. (Thanks to Erik Jenner for mentioning that thought to me; there is a chance that I misrepresent his thinking, though.)