The Lightcone Theorem says: conditional on X0, any sets of variables in X which are a distance of at least 2T apart in the graphical model are independent.
I am confused. This sounds to me like:
If you have sets of variables that start with no mutual information (conditioning on X0), and they are so far away that nothing other than X0 could have affected both of them (distance of at least 2T), then they continue to have no mutual information (independent).
Some things that I am confused about as a result:
I don’t see why you are surprised, or why you would have said it wouldn’t work for finite T. (It seems obviously true to me from the statement, which makes me think I’m missing some subtlety.)
I don’t understand why the distribution of X0 must be the same as the distribution of X. It seems like it should hold for arbitrary X0.
I don’t see why this is relevant for natural abstractions. To me, the interesting part about abstractions is that it is generally fine to keep track of a small amount of information, even though there is tons and tons of information that “could have” been relevant (and does affect outcomes but in a way that is “noise” rather than “signal”). But this theorem is only telling you that you can throw away information that could never possibly have been relevant.
If you have sets of variables that start with no mutual information (conditioning on X0), and they are so far away that nothing other than X0 could have affected both of them (distance of at least 2T), then they continue to have no mutual information (independent).
Yup, that’s basically it. And I agree that it’s pretty obvious once you see it—the key is to notice that distance 2T implies that nothing other than X0 could have affected both of them. But man, when I didn’t know that was what I should look for? Much less obvious.
I don’t understand why the distribution of X0 must be the same as the distribution of X. It seems like it should hold for arbitrary X0.
It does, but then XT doesn’t have the same distribution as the original graphical model (unless we’re running the sampler long enough to equilibrate). So we can’t view X0 as a latent generating that distribution.
But this theorem is only telling you that you can throw away information that could never possibly have been relevant.
Not quite—note that the resampler itself throws away a ton of information aboutX0 while going from X0 to XT. And that is indeed information which “could have” been relevant, but almost always gets wiped out by noise. That’s the information we’re looking to throw away, for abstraction purposes.
So the reason this is interesting (for the thing you’re pointing to) is not that it lets us ignore information from far-away parts of XT which could not possibly have been relevant given X0, but rather that we want to further throw away information from X0 itself (while still maintaining conditional independence at a distance).
Yup, that’s basically it. And I agree that it’s pretty obvious once you see it—the key is to notice that distance 2T implies that nothing other than X0 could have affected both of them. But man, when I didn’t know that was what I should look for? Much less obvious.
note that the resampler itself throws away a ton of information aboutX0 while going from X0 to XT. And that is indeed information which “could have” been relevant, but almost always gets wiped out by noise. That’s the information we’re looking to throw away, for abstraction purposes.
I agree this is true, but why does the Lightcone theorem matter for it?
It is also a theorem that a Gibbs resampler initialized at equilibrium will produce XT distributed according to X, and as you say it’s clear that the resampler throws away a ton of information about X0 in computing it. Why not use that theorem as the basis for identifying the information to throw away? In other words, why not throw away information from X0 while maintaining XT∼X ?
EDIT: Actually, conditioned on X0, it is not the case that XT is distributed according to X.
(Simple counterexample: Take a graphical model where node A can be 0 or 1 with equal probability, and A causes B through a chain of > 2T steps, such that we always have B = A for a true sample from X. In such a setting, for a true sample from X, B should be equally likely to be 0 or 1, but BT∣X0=B0, i.e. it is deterministic.)
Of course, this is a problem for both my proposal and for the Lightcone theorem—in either case you can’t view X0 as a latent that generates X (which seems to be the main motivation, though I’m still not quite sure why that’s the motivation).
Sounds like we need to unpack what “viewing X0 as a latent which generates X” is supposed to mean.
I start with a distribution P[X]. Let’s say X is a bunch of rolls of a biased die, of unknown bias. But I don’t know that’s what X is; I just have the joint distribution of all these die-rolls. What I want to do is look at that distribution and somehow “recover” the underlying latent variable (bias of the die) and factorization, i.e. notice that I can write the distribution as P[X]=∑iP[Xi|Λ]P[Λ], where Λ is the bias in this case. Then when reasoning/updating, we can usually just think about how an individual die-roll interacts with Λ, rather than all the other rolls, which is useful insofar as Λ is much smaller than all the rolls.
Note that P[X|Λ] is not supposed to match P[X]; then the representation would be useless. It’s the marginal ∑iP[Xi|Λ]P[Λ] which is supposed to match P[X].
The lightcone theorem lets us do something similar. Rather all the Xi‘s being independent given Λ, only those Xi’s sufficiently far apart are independent, but the concept is otherwise similar. We express P[X] as ∑X0P[X|X0]P[X0] (or, really, ∑ΛP[X|Λ]P[Λ], where Λ summarizes info in X0 relevant to X, which is hopefully much smaller than all of X).
I’m still not quite sure why the lightcone theorem is a “foundation” for natural abstraction (it looks to me like a nice concrete example on which you could apply techniques) but I think I should just wait for future posts, since I don’t really have any concrete questions at the moment.
I’m still not quite sure why the lightcone theorem is a “foundation” for natural abstraction (it looks to me like a nice concrete example on which you could apply techniques)
My impression is that it being a concrete example is the why. “What is the right framework to use?” and “what is the environment-structure in which natural abstractions can be defined?” are core questions of this research agenda, and this sort of multi-layer locality-including causal model is one potential answer.
The fact that it loops-in the speed of causal influence is also suggestive — it seems fundamental to the structure of our universe, crops up in a lot of places, so the proposition that natural abstractions are somehow downstream of it is interesting.
I am confused. This sounds to me like:
If you have sets of variables that start with no mutual information (conditioning on X0), and they are so far away that nothing other than X0 could have affected both of them (distance of at least 2T), then they continue to have no mutual information (independent).
Some things that I am confused about as a result:
I don’t see why you are surprised, or why you would have said it wouldn’t work for finite T. (It seems obviously true to me from the statement, which makes me think I’m missing some subtlety.)
I don’t understand why the distribution of X0 must be the same as the distribution of X. It seems like it should hold for arbitrary X0.
I don’t see why this is relevant for natural abstractions. To me, the interesting part about abstractions is that it is generally fine to keep track of a small amount of information, even though there is tons and tons of information that “could have” been relevant (and does affect outcomes but in a way that is “noise” rather than “signal”). But this theorem is only telling you that you can throw away information that could never possibly have been relevant.
Yup, that’s basically it. And I agree that it’s pretty obvious once you see it—the key is to notice that distance 2T implies that nothing other than X0 could have affected both of them. But man, when I didn’t know that was what I should look for? Much less obvious.
It does, but then XT doesn’t have the same distribution as the original graphical model (unless we’re running the sampler long enough to equilibrate). So we can’t view X0 as a latent generating that distribution.
Not quite—note that the resampler itself throws away a ton of information about X0 while going from X0 to XT. And that is indeed information which “could have” been relevant, but almost always gets wiped out by noise. That’s the information we’re looking to throw away, for abstraction purposes.
So the reason this is interesting (for the thing you’re pointing to) is not that it lets us ignore information from far-away parts of XT which could not possibly have been relevant given X0, but rather that we want to further throw away information from X0 itself (while still maintaining conditional independence at a distance).
… I feel compelled to note that I’d pointed out a very similar thing a while ago.
Granted, that’s not exactly the same formulation, and the devil’s in the details.
Okay, that mostly makes sense.
I agree this is true, but why does the Lightcone theorem matter for it?
It is also a theorem that a Gibbs resampler initialized at equilibrium will produce XT distributed according to X, and as you say it’s clear that the resampler throws away a ton of information about X0 in computing it. Why not use that theorem as the basis for identifying the information to throw away? In other words, why not throw away information from X0 while maintaining XT∼X ?
EDIT: Actually, conditioned on X0, it is not the case that XT is distributed according to X.
(Simple counterexample: Take a graphical model where node A can be 0 or 1 with equal probability, and A causes B through a chain of > 2T steps, such that we always have B = A for a true sample from X. In such a setting, for a true sample from X, B should be equally likely to be 0 or 1, but BT∣X0=B0, i.e. it is deterministic.)
Of course, this is a problem for both my proposal and for the Lightcone theorem—in either case you can’t view X0 as a latent that generates X (which seems to be the main motivation, though I’m still not quite sure why that’s the motivation).
Sounds like we need to unpack what “viewing X0 as a latent which generates X” is supposed to mean.
I start with a distribution P[X]. Let’s say X is a bunch of rolls of a biased die, of unknown bias. But I don’t know that’s what X is; I just have the joint distribution of all these die-rolls. What I want to do is look at that distribution and somehow “recover” the underlying latent variable (bias of the die) and factorization, i.e. notice that I can write the distribution as P[X]=∑iP[Xi|Λ]P[Λ], where Λ is the bias in this case. Then when reasoning/updating, we can usually just think about how an individual die-roll interacts with Λ, rather than all the other rolls, which is useful insofar as Λ is much smaller than all the rolls.
Note that P[X|Λ] is not supposed to match P[X]; then the representation would be useless. It’s the marginal ∑iP[Xi|Λ]P[Λ] which is supposed to match P[X].
The lightcone theorem lets us do something similar. Rather all the Xi‘s being independent given Λ, only those Xi’s sufficiently far apart are independent, but the concept is otherwise similar. We express P[X] as ∑X0P[X|X0]P[X0] (or, really, ∑ΛP[X|Λ]P[Λ], where Λ summarizes info in X0 relevant to X, which is hopefully much smaller than all of X).
Okay, I understand how that addresses my edit.
I’m still not quite sure why the lightcone theorem is a “foundation” for natural abstraction (it looks to me like a nice concrete example on which you could apply techniques) but I think I should just wait for future posts, since I don’t really have any concrete questions at the moment.
My impression is that it being a concrete example is the why. “What is the right framework to use?” and “what is the environment-structure in which natural abstractions can be defined?” are core questions of this research agenda, and this sort of multi-layer locality-including causal model is one potential answer.
The fact that it loops-in the speed of causal influence is also suggestive — it seems fundamental to the structure of our universe, crops up in a lot of places, so the proposition that natural abstractions are somehow downstream of it is interesting.