First bullet is correct, second bullet is close but not quite right. Just one sample of a natural latent will give you (approximately) all the mutual information between the X’s, and can give you some additional “noise” as well.
E.g. in the biased die example with many rolls, we can sample the bias given the rolls. Because that distribution is very pointy the sample will be very close to the “true bias”, that one sample will capture approximately-all of the mutual information between the rolls.
(Note: I did skip a subtle step there—our natural latents need a stronger condition than just “close to the true bias” in this example, since the low-order bits of the latent could in-principle contain a bunch of relevant information which the true bias doesn’t; that would mess everything up. And indeed, that would mess everything up if we tried to use e.g. the empirical frequencies rather than a sample from P[bias | X]: given all but one die roll and the empirical frequencies calculated from all of the die rolls, we could exactly calculate the outcome of the remaining die roll. That’s why we do the sampling thing; the little bit of noise introduced by sampling is load-bearing, since it wipes out info in those low-order bits.
… but that’s a subtlety which you should not worry about until after the main picture makes sense conceptually.)
First bullet is correct, second bullet is close but not quite right. Just one sample of a natural latent will give you (approximately) all the mutual information between the X’s, and can give you some additional “noise” as well.
E.g. in the biased die example with many rolls, we can sample the bias given the rolls. Because that distribution is very pointy the sample will be very close to the “true bias”, that one sample will capture approximately-all of the mutual information between the rolls.
(Note: I did skip a subtle step there—our natural latents need a stronger condition than just “close to the true bias” in this example, since the low-order bits of the latent could in-principle contain a bunch of relevant information which the true bias doesn’t; that would mess everything up. And indeed, that would mess everything up if we tried to use e.g. the empirical frequencies rather than a sample from P[bias | X]: given all but one die roll and the empirical frequencies calculated from all of the die rolls, we could exactly calculate the outcome of the remaining die roll. That’s why we do the sampling thing; the little bit of noise introduced by sampling is load-bearing, since it wipes out info in those low-order bits.
… but that’s a subtlety which you should not worry about until after the main picture makes sense conceptually.)