Making the Telephone Theorem and Its Proof Precise
This short form distills the Telephone Theorem and its proof. The short form will thereby not at all be “intuitive”; the only goal is to be mathematically precise at every step.
Let M0,M1,… be jointly distributed finite random variables, meaning they are all functions
Mi:Ω→Mi
starting from the same finite sample space with a given probability distribution P and into respective finite value spaces Mi. Additionally, assume that these random variables form a Markov chain M0→M1→….
Lemma: For a Markov chain M0→M1→M2, the following two statements are equivalent:
(a) I(M1;M0)=I(M2;M0)
(b) For all m1,m2 with P(m1,m2)>0: P(M0∣m1)=P(M0∣m2).
Proof:
Assume (a): Inspecting an information diagram of M0,M1,M2 will immediately result in us also observing the Markov chain M0→M2→M1. Markov chains can be turned around, thus we get the two chains
M1→M2→M0,M2→M1→M0.
Factorizing along these two chains, we obtain:
P(M0∣M2)⋅P(M1,M2)=P(M0,M1,M2)=P(M0∣M1)⋅P(M1,M2)
and thus, for all m1,m2 with P(m1,m2)>0:P(M0∣m1)=P(M0∣m2). That proves (b).
where, in the second step, we used the Markov chain M0→M1→M2 and in the third step, we used assumption (b). This independence gives us the vanishing of conditional mutual information:
I(M0;M1∣M2)=0.
Together with the Markov chain M0→M1→M2, this results, by inspecting an information diagram, in the equality I(M1;M0)=I(M2;M0). □
Theorem: Let n≥1. The following are equivalent:
(a) I(Mn;M0)=I(Mn+1;M0)
(b) There are functions fn,fn+1 defined on Mn,Mn+1, respectively such that:
fn(Mn)=fn+1(Mn+1) with probability 1, i.e., the measure of all ω∈Ω such that the equality doesn’t hold is zero.
For all mn∈Mn, we have the equality P(M0∣Mn=mn)=P(M0∣fn(Mn)=fn(mn)), and the same for n+1.
Proof: The Markov chain immediately also gives us a Markov chain M0→Mn→Mn+1, meaning we can without loss of generality assume that n=1. So let’s consider the simple Markov chain M0→M1→M2.
Assume (a): By the lemma, this gives us for all m1,m2 with P(m1,m2)>0: P(M0∣m1)=P(M0∣m2).
Define the two functions fi:Mi→Δ(M0), i=1,2 by:
fi(mi):=P(M0∣mi).
Then we have f1(m1)=f2(m2) with probability 1[1], giving us the first condition we wanted to prove.
For the second condition, we use a trick from Probability as Minimal Map: set p:=f1(m1), which is a probability distribution. We get
There is some subtlety about whether the random variable f1(M1) can be replaced by f2(M2) in that equation. But given that they are “almost” the same random variables, I think this is valid inside the probability equation.
Edit: This is now obsolete with our NAH distillation.
Making the Telephone Theorem and Its Proof Precise
This short form distills the Telephone Theorem and its proof. The short form will thereby not at all be “intuitive”; the only goal is to be mathematically precise at every step.
Let M0,M1,… be jointly distributed finite random variables, meaning they are all functions
Mi:Ω→Mistarting from the same finite sample space with a given probability distribution P and into respective finite value spaces Mi. Additionally, assume that these random variables form a Markov chain M0→M1→….
Lemma: For a Markov chain M0→M1→M2, the following two statements are equivalent:
(a) I(M1;M0)=I(M2;M0)
(b) For all m1,m2 with P(m1,m2)>0: P(M0∣m1)=P(M0∣m2).
Proof:
Assume (a): Inspecting an information diagram of M0,M1,M2 will immediately result in us also observing the Markov chain M0→M2→M1. Markov chains can be turned around, thus we get the two chains
M1→M2→M0, M2→M1→M0.Factorizing along these two chains, we obtain:
P(M0∣M2)⋅P(M1,M2)=P(M0,M1,M2)=P(M0∣M1)⋅P(M1,M2)and thus, for all m1,m2 with P(m1,m2)>0: P(M0∣m1)=P(M0∣m2). That proves (b).
Assume (b): We have
P(M0,M1∣M2)=P(M0∣M1,M2)⋅P(M1∣M2)=P(M0∣M1)⋅P(M1∣M2)=P(M0∣M2)⋅P(M1∣M2),where, in the second step, we used the Markov chain M0→M1→M2 and in the third step, we used assumption (b). This independence gives us the vanishing of conditional mutual information:
I(M0;M1∣M2)=0.Together with the Markov chain M0→M1→M2, this results, by inspecting an information diagram, in the equality I(M1;M0)=I(M2;M0). □
Theorem: Let n≥1. The following are equivalent:
(a) I(Mn;M0)=I(Mn+1;M0)
(b) There are functions fn,fn+1 defined on Mn,Mn+1, respectively such that:
fn(Mn)=fn+1(Mn+1) with probability 1, i.e., the measure of all ω∈Ω such that the equality doesn’t hold is zero.
For all mn∈Mn, we have the equality P(M0∣Mn=mn)=P(M0∣fn(Mn)=fn(mn)), and the same for n+1.
Proof: The Markov chain immediately also gives us a Markov chain M0→Mn→Mn+1, meaning we can without loss of generality assume that n=1. So let’s consider the simple Markov chain M0→M1→M2.
Assume (a): By the lemma, this gives us for all m1,m2 with P(m1,m2)>0: P(M0∣m1)=P(M0∣m2).
Define the two functions fi:Mi→Δ(M0), i=1,2 by:
fi(mi):=P(M0∣mi).Then we have f1(m1)=f2(m2) with probability 1[1], giving us the first condition we wanted to prove.
For the second condition, we use a trick from Probability as Minimal Map: set p:=f1(m1), which is a probability distribution. We get
P(M0∣f1(M1)=f1(m1))=P(M0∣f1(M1)=p)=P(M0, f1(M1)=p)P(f1(M1)=p)=∑m′1:f1(m′1)=pP(M0,m′1)∑m′1:f1(m′1)=pP(m′1)=∑m′1:f1(m′1)=pP(M0∣m′1)⋅P(m′1)∑m′1:f1(m′1)=pP(m′1)=∑m′1:f1(m′1)=pp⋅P(m′1)∑m′1:f1(m′1)=pP(m′1)=p=f1(m1)=P(M0∣M1=m1).That proves (b).
Assume (b): For the other direction, let m1,m2 be given with P(m1,m2)>0. Let ω∈Ω be such that (M1(ω),M2(ω))=(m1,m2) and with P(ω)>0. We have
f1(m1)=[f1(M1)](ω)=[f2(M2)](ω)=f2(m2)and thus
P(M0∣m1)=P(M0∣f1(M1)=f1(m1))=P(M0∣f2(M2)=f2(m2))=P(M0∣m2).The result follows from the Lemma.[2]
Somehow, my brain didn’t find this obvious. Here is an explanation:
P({ω∣f1(M1(ω))≠f2(M2(ω))})≤P({ω∣P(M1(ω),M2(ω))=0})
=P((M1,M2)−1({(m1,m2)∣P(m1,m2)=0}))=∑m1,m2:P(m1,m2)=0P((M1,M2)−1(m1,m2))
=∑m1,m2:P(m1,m2)=0P(m1,m2)=0
There is some subtlety about whether the random variable f1(M1) can be replaced by f2(M2) in that equation. But given that they are “almost” the same random variables, I think this is valid inside the probability equation.