I’ve been doing a deep dive on this post, and while the main theorems make sense I find myself quite confused about some basic concepts. I would really appreciate some help here!
So ‘latents’ are defined by their conditional distribution functions whose shape is implicit in the factorization that the latents need to satisfy, meaning they don’t have to always look like P[Λ|X], they can look like P[Λ],P[X|Λ], etc, right?
I don’t get the ‘standard form’ business. It seems like a procedure to turn one latent variable Λ into another relative to X? I don’t get what the notation Λ∗=(x↦P[X=x∣Λ]) means—does it mean that it takes Λ defined by some conditional distribution function like P[Λ],P[X|Λ],P[Λ|X] and converts it into P[X|Λ]? That doesn’t seem so, the notation looks more like a likelihood function than a conditional distribution. But then what conditional distribution defines this latent Λ∗?
The Resampling stuff is a bit confusing too:
if we have a natural latent Λ, then construct a new natural latent by resampling Λ conditional on X (i.e. sample from P[Λ|X]), independently of whatever other stuff Λ′ we’re interested in.
I don’t know what operation is being performed here—what CPDs come in, what CPDs leave.
“construct a new natural latent by resampling Λ conditional on X (i.e. sample from P[Λ|X]), independently of whatever other stuff Λ′ we’re interested in.” isn’t this what we are already doing when stating a diagram like X1←Λ→X2, which implies a factorization P[Λ]P[X1|Λ]P[X2|Λ], none of which have Λ′! What changes when resampling? aaaaahhh I think I’m really confused here.
Also does all this imply that we’re starting out assuming that Λ shares a probability space with all the other possible latents, e.g. P[X,Λ,Λ′,Λ′′,…]? How does this square with a latent variable being defined by the CPD implicit in the factorization?
And finally:
In standard form, a natural latent is always approximately a deterministic function of X. Specifically: Λ(X)≈∏i(x′↦P[Xi=x′i|X¯i]).
...
Suppose there exists an approximate natural latent over X1,…,Xn. Construct a new random variable X′, sampled from the distribution (x′↦∏iP[Xi=x′i|X¯i]). (In other words: simultaneously resample each Xi given all the others.) Conjecture: X′ is an approximate natural latent (though the approximation may not be the best possible). And if so, a key question is: how good is the approximation?
Where is the top result proved, and how is this statement different from the Universal Natural Latent Conjecture below? Also is this post relevant to either of these statements, and if so, does that mean they only hold under strong redundancy?
So ‘latents’ are defined by their conditional distribution functions whose shape is implicit in the factorization that the latents need to satisfy, meaning they don’t have to always look like P[Λ|X], they can look like P[Λ],P[X|Λ], etc, right?
The key idea here is that, when “choosing a latent”, we’re not allowed to choose P[X]; P[X] is fixed/known/given, a latent is just a helpful tool for reasoning about or representing P[X]. So another way to phrase it is: we’re choosing our whole model P[X,Λ], but with a constraint on the marginal P[X]. P[Λ|X] then captures all of the degrees of freedom we have in choosing a latent.
Now, we won’t typically represent the latent explicitly as P[Λ|X]. Typically we’ll choose latents such that P[X,Λ] satisfies some factorization(s), and those factorizations will provide a more compact representation of the distribution than two giant tables for P[X], P[Λ|X]. For instance, insofar as P[Λ,X] factors as P[Λ]∏iP[Xi|Λ], we might want to represent the distribution as P[Λ] and {P[Xi|Λ]} (both for analytic and computational purposes).
I don’t get the ‘standard form’ business.
We’ve largely moved away from using the standard form anyway, I recommend ignoring it at this point.
Also is this post relevant to either of these statements, and if so, does that mean they only hold under strong redundancy?
Yup, that post proves the universal natural latent conjecture when strong redundancy holds (over 3 or more variables). Whether the conjecture does not hold when strong redundancy fails is an open question. But since the strong redundancy result we’ve mostly shifted toward viewing strong redundancy as the usual condition to look for, and focused less on weak redundancy.
Resampling
Also does all this imply that we’re starting out assuming that Λ shares a probability space with all the other possible latents, e.g. P[X,Λ,Λ′,Λ′′,…]? How does this square with a latent variable being defined by the CPD implicit in the factorization?
We conceptually start with the objects P[X], P[Λ|X], and P[Λ′|X]. (We’re imagining here that two different agents measure the same distribution P[X], but then they each model it using their own latents.) Given only those objects, the joint distribution P[X,Λ,Λ′] is underdefined—indeed, it’s unclear what such a joint distribution would even mean! Whose distribution is it?
One simple answer (unsure whether this will end up being the best way to think about it): one agent is trying to reason about the observables X, their own latent Λ, and the other agent’s latent Λ′ simultaneously, e.g. in order to predict whether the other agent’s latent is isomorphic to Λ (as would be useful for communication).
Since Λ and Λ′ are both latents, in order to define P[X,Λ,Λ′], the agent needs to specify P[Λ,Λ′|X]. That’s where the underdefinition comes in: only P[Λ|X] and P[Λ′|X] were specified up-front. So, we sidestep the problem: we construct a new latent Λ′′ such that P[Λ′′|X] matches P[Λ|X], but Λ′′ is independent of Λ′ given X. Then we’ve specified the whole distribution P[X,Λ′,Λ′′]=P[X]P[Λ′|X]P[Λ′′|X].
Hopefully that clarifies what the math is, at least. It’s still a bit fishy conceptually, and I’m not convinced it’s the best way to handle the part it’s trying to handle.
I’ve been doing a deep dive on this post, and while the main theorems make sense I find myself quite confused about some basic concepts. I would really appreciate some help here!
So ‘latents’ are defined by their conditional distribution functions whose shape is implicit in the factorization that the latents need to satisfy, meaning they don’t have to always look like P[Λ|X], they can look like P[Λ],P[X|Λ], etc, right?
I don’t get the ‘standard form’ business. It seems like a procedure to turn one latent variable Λ into another relative to X? I don’t get what the notation Λ∗=(x↦P[X=x∣Λ]) means—does it mean that it takes Λ defined by some conditional distribution function like P[Λ],P[X|Λ],P[Λ|X] and converts it into P[X|Λ]? That doesn’t seem so, the notation looks more like a likelihood function than a conditional distribution. But then what conditional distribution defines this latent Λ∗?
The Resampling stuff is a bit confusing too:
I don’t know what operation is being performed here—what CPDs come in, what CPDs leave.
“construct a new natural latent by resampling Λ conditional on X (i.e. sample from P[Λ|X]), independently of whatever other stuff Λ′ we’re interested in.” isn’t this what we are already doing when stating a diagram like X1←Λ→X2, which implies a factorization P[Λ]P[X1|Λ]P[X2|Λ], none of which have Λ′! What changes when resampling? aaaaahhh I think I’m really confused here.
Also does all this imply that we’re starting out assuming that Λ shares a probability space with all the other possible latents, e.g. P[X,Λ,Λ′,Λ′′,…]? How does this square with a latent variable being defined by the CPD implicit in the factorization?
And finally:
Where is the top result proved, and how is this statement different from the Universal Natural Latent Conjecture below? Also is this post relevant to either of these statements, and if so, does that mean they only hold under strong redundancy?
The key idea here is that, when “choosing a latent”, we’re not allowed to choose P[X]; P[X] is fixed/known/given, a latent is just a helpful tool for reasoning about or representing P[X]. So another way to phrase it is: we’re choosing our whole model P[X,Λ], but with a constraint on the marginal P[X]. P[Λ|X] then captures all of the degrees of freedom we have in choosing a latent.
Now, we won’t typically represent the latent explicitly as P[Λ|X]. Typically we’ll choose latents such that P[X,Λ] satisfies some factorization(s), and those factorizations will provide a more compact representation of the distribution than two giant tables for P[X], P[Λ|X]. For instance, insofar as P[Λ,X] factors as P[Λ]∏iP[Xi|Λ], we might want to represent the distribution as P[Λ] and {P[Xi|Λ]} (both for analytic and computational purposes).
We’ve largely moved away from using the standard form anyway, I recommend ignoring it at this point.
Yup, that post proves the universal natural latent conjecture when strong redundancy holds (over 3 or more variables). Whether the conjecture does not hold when strong redundancy fails is an open question. But since the strong redundancy result we’ve mostly shifted toward viewing strong redundancy as the usual condition to look for, and focused less on weak redundancy.
Resampling
We conceptually start with the objects P[X], P[Λ|X], and P[Λ′|X]. (We’re imagining here that two different agents measure the same distribution P[X], but then they each model it using their own latents.) Given only those objects, the joint distribution P[X,Λ,Λ′] is underdefined—indeed, it’s unclear what such a joint distribution would even mean! Whose distribution is it?
One simple answer (unsure whether this will end up being the best way to think about it): one agent is trying to reason about the observables X, their own latent Λ, and the other agent’s latent Λ′ simultaneously, e.g. in order to predict whether the other agent’s latent is isomorphic to Λ (as would be useful for communication).
Since Λ and Λ′ are both latents, in order to define P[X,Λ,Λ′], the agent needs to specify P[Λ,Λ′|X]. That’s where the underdefinition comes in: only P[Λ|X] and P[Λ′|X] were specified up-front. So, we sidestep the problem: we construct a new latent Λ′′ such that P[Λ′′|X] matches P[Λ|X], but Λ′′ is independent of Λ′ given X. Then we’ve specified the whole distribution P[X,Λ′,Λ′′]=P[X]P[Λ′|X]P[Λ′′|X].
Hopefully that clarifies what the math is, at least. It’s still a bit fishy conceptually, and I’m not convinced it’s the best way to handle the part it’s trying to handle.
Thank you, that is very clarifying!