So, we’ve also got an analogue of KL-divergence for crisp infradistributions.
We’ll be using P and Q for crisp infradistributions, and p and q for probability distributions associated with them.DKL will be used for the KL-divergence of infradistributions, and dKL will be used for the KL-divergence of probability distributions. For crisp infradistributions, the KL-divergence is defined as
DKL(P|Q):=maxq∈Qminp∈PdKL(p|q)
I’m not entirely sure why it’s like this, but it has the basic properties you would expect of the KL-divergence, like concavity in both arguments and interacting well with continuous pushforwards and semidirect product.
Straight off the bat, we have:
Proposition 1:DKL(P|Q)≥0
Proof: KL-divergence between probability distributions is always nonnegative, by Gibb’s inequality.
Now, there’s something else we can notice. When choosing j, it doesn’t matter what p is selected, you want to take every x and maximize the quantity inside the expectation, that consideration selects your jx. So, then we can get:
For this, we’ll need to use the Disintegration Theorem (the classical version for probability distributions), and adapt some results from Proposition 5. Let’s show as much as we can before showing this.
for any pair of probability distributions p and q. Now, here’s what we’ll do. The p and q gives us probability distributions over X, and the g∗(p) and g∗(q) are probability distributions over Y. So, let’s take the joint distribution over X×Y given by selecting a point from X according to the relevant distribution and applying g. By the classical version of the disintegration theorem, we can write it either way as starting with the marginal distribution over X and a semidirect product to Y, or by starting with the marginal distribution over Y and you take a semidirect product with some markov kernel to X to get the joint distribution. So, we have:
dKL(p⋉g|q⋉g)=dKL(g∗(p)⋉k|g∗(q)⋉j)
for some Markov kernels k,j:Y→X. Why? Well, the joint distribution over X×Y is given by p⋉g or q⋉g respectively (you have a starting distribution, and g lets you take an input in X and get an output in Y). But, breaking it down the other way, we start with the marginal distribution of those joint distributions on Y (the pushforward w.r.t. g), and can write the joint distribution as semidirect product going the other way. Basically, it’s just two different ways of writing the same distributions, so that’s why KL-divergence doesn’t vary at all.
Now, it is also a fact that, for semidirect products (sorry, we’re gonna let p,q,k,j be arbitrary here and unconnected to the fixed ones we were looking at earlier, this is just a general property of semidirect products), we have:
dKL(p⋉k|q⋉j)=dKL(p|q)+Ep[dKL(k(x)|j(x))]
To see this, run through the proof of Proposition 5, because probability distributions are special cases of infradistributions. Running up to right up before the inequality, we had
The first equality is our expansion of semidirect product for probability distributions, second equality is the probability distributions being equal, and third equality is, again, expansion of semidirect product for probability distributions. Contracting the two sides of this, we have:
So, we’ve also got an analogue of KL-divergence for crisp infradistributions.
We’ll be using P and Q for crisp infradistributions, and p and q for probability distributions associated with them.DKL will be used for the KL-divergence of infradistributions, and dKL will be used for the KL-divergence of probability distributions. For crisp infradistributions, the KL-divergence is defined as
DKL(P|Q):=maxq∈Qminp∈PdKL(p|q)
I’m not entirely sure why it’s like this, but it has the basic properties you would expect of the KL-divergence, like concavity in both arguments and interacting well with continuous pushforwards and semidirect product.
Straight off the bat, we have:
Proposition 1: DKL(P|Q)≥0
Proof: KL-divergence between probability distributions is always nonnegative, by Gibb’s inequality.
Proposition 2: DKL(P|Q)=0↔Q⊆P
DKL(P|Q)=0↔maxq∈Qminp∈PdKL(p|q)=0↔∀q∈Q:minp∈PdKL(p|q)=0
∀q∈Q:minp∈PdKL(p|q)=0↔∀q∈Q∃p∈P:dKL(p|q)=0
And now, because KL-divergence between probability distributions is 0 only when they’re equal, we have:
∀q∈Q∃p∈P:dKL(p|q)=0↔∀q∈Q:q∈P↔Q⊆P
Proposition 3: If u is the uniform distribution on X, then DKL(P|u)=ln(|X|)−H(P)
DKL(P|u)=minp∈PdKL(p|u)=minp∈P(H(p,u)−H(p))
And the cross-entropy of any distribution with the uniform distribution is always ln(|X|), so:
=ln(|X|)+minp∈P(−H(p))=ln(|X|)−maxp∈PH(p)=ln(|X|)−H(P)
Proposition 4: DKL is a concave function over (□crispX)2.
Proof: Let’s use a as our number in [0,1] in order to talk about mixtures. Then,
DKL(aP1+(1−a)P2|aQ1+(1−a)Q2)=maxq∈aQ1+(1−a)Q2minp∈aP1+(1−a)P2dKL(p|q)
=maxq1∈Q1,q2∈Q2minp1∈P1,p2∈P2dKL(ap1+(1−a)p2|aq1+(1−a)q2)
Then we apply concavity of the KL-divergence for probability distributions to get:
≤maxq1∈Q1,q2∈Q2minp1∈P1,p2∈P2(adKL(p1|q1)+(1−a)dKL(p2|q2))
=maxq1∈Q1,q2∈Q2(aminp1∈P1dKL(p1|q1)+(1−a)minp2∈P2dKL(p2|q2))
=amaxq1∈Q1minp1∈P1dKL(p1|q1)+(1−a)maxq2∈Q2minp2∈P2dKL(p2|q2)
=aDKL(P1|Q1)+(1−a)DKL(P2|Q2)
Proposition 5: DKL(P⋉K|Q⋉J)≥DKL(P|Q)+minp∈PEp[λx.DKL(K(x)|J(x))]
DKL(P⋉K|Q⋉J)=maxq∈Q,j∈∏xJ(x)minp∈P,k∈∏xK(x)dKL(p⋉k|q⋉j)
=maxq∈Q,j∈∏xJ(x)minp∈P,k∈∏xK(x)∑x,y(p⋉k)(x,y)⋅ln((p⋉k)(x,y)(q⋉j)(x,y))
=maxq∈Q,j∈∏xJ(x)minp∈P,k∈∏xK(x)∑x,yp(x)⋅kx(y)⋅ln(p(x)⋅kx(y)q(x)⋅jx(y))
=maxq∈Q,j∈∏xJ(x)minp∈P,k∈∏xK(x)∑x,yp(x)⋅kx(y)⋅(ln(p(x)q(x))+ln(kx(y)jx(y)))
=maxq∈Q,j∈∏xJ(x)minp∈P,k∈∏xK(x)(∑x,yp(x)⋅kx(y)⋅ln(p(x)q(x))+∑x,yp(x)⋅kx(y)⋅ln(kx(y)jx(y)))
=maxq∈Q,j∈∏xJ(x)minp∈P,k∈∏xK(x)((∑xp(x)⋅ln(p(x)q(x))⋅(∑ykx(y)))
+(∑xp(x)⋅(∑ykx(y)⋅ln(kx(y)jx(y)))))
At this point we can abbreviate the KL-divergence, and observe that we have a multiplication by 1, to get:
=maxq∈Q,j∈∏xJ(x)minp∈P,k∈∏xK(x)(dKL(p|q)+∑xp(x)⋅dKL(kx|jx))
And then pack up the expectation
=maxq∈Q,j∈∏xJ(x)minp∈P,k∈∏xK(x)(dKL(p|q)+Ep[λx.dKL(kx|jx)])
Then, with the choice of q and j fixed, we can move the choice of the kx all the way inside, to get:
=maxq∈Q,j∈∏xJ(x)minp∈P(dKL(p|q)+Ep[λx.minkx∈K(x)dKL(kx|jx)])
Now, there’s something else we can notice. When choosing j, it doesn’t matter what p is selected, you want to take every x and maximize the quantity inside the expectation, that consideration selects your jx. So, then we can get:
=maxq∈Qminp∈P(dKL(p|q)+Ep[λx.maxjx∈J(x)minkx∈K(x)dKL(kx|jx)])
And pack up the KL-divergence to get:
=maxq∈Qminp∈P(dKL(p|q)+Ep[λx.DKL(K(x)|J(x))])
And distribute the min to get:
≥maxq∈Q(minp∈PdKL(p|q)+minp∈PEp[λx.DKL(K(x)|J(x))])
And then, we can pull out that fixed quantity and get:
=maxq∈Qminp∈PdKL(p|q)+minp∈PEp[λx.DKL(K(x)|J(x))]
And pack up the KL-divergence to get:
=DKL(P|Q)+minp∈PEp[λx.DKL(K(x)|J(x))]
Proposition 6: DKL(P1×P2|Q1×Q2)=DKL(P1|Q1)+DKL(P2|Q2)
To do this, we’ll go through the proof of proposition 5 to the first place where we have an inequality. The last step before inequality was:
DKL(P⋉K|Q⋉J)=maxq∈Qminp∈P(dKL(p|q)+Ep[λx.DKL(K(x)|J(x))])
Now, for a direct product, it’s like semidirect product but all the K(x) and J(x) are the same infradistribution, so we have:
DKL(P1×P2|Q1×Q2)=maxq1∈Q1minp1∈P1(dKL(p1|q1)+Ep1[λx.DKL(P2|Q2)])
Now, this is a constant, so we can pull it out of the expectation to get:
=maxq1∈Q1minp1∈P1(dKL(p1|q1)+DKL(P2|Q2))
=maxq1∈Q1minp1∈P1dKL(p1|q1)+DKL(P2|Q2)=DKL(P1|Q1)+DKL(P2|Q2)
Proposition 7: DKL(g∗(P)|g∗(Q))≤DKL(P|Q)
For this, we’ll need to use the Disintegration Theorem (the classical version for probability distributions), and adapt some results from Proposition 5. Let’s show as much as we can before showing this.
DKL(g∗(P)|g∗(Q))=maxq′∈g∗(Q)minp′∈g∗(P)dKL(p′|q′)=maxq∈Qminp∈PdKL(g∗(p)|g∗(q))
Now, hypothetically, if we had
dKL(g∗(p)|g∗(q))≤dKL(p|q)
then we could use that result to get
≤maxq∈Qminp∈PdKL(p|q)=DKL(P|Q)
and we’d be done. So, our task is to show
dKL(g∗(p)|g∗(q))≤dKL(p|q)
for any pair of probability distributions p and q. Now, here’s what we’ll do. The p and q gives us probability distributions over X, and the g∗(p) and g∗(q) are probability distributions over Y. So, let’s take the joint distribution over X×Y given by selecting a point from X according to the relevant distribution and applying g. By the classical version of the disintegration theorem, we can write it either way as starting with the marginal distribution over X and a semidirect product to Y, or by starting with the marginal distribution over Y and you take a semidirect product with some markov kernel to X to get the joint distribution. So, we have:
dKL(p⋉g|q⋉g)=dKL(g∗(p)⋉k|g∗(q)⋉j)
for some Markov kernels k,j:Y→X. Why? Well, the joint distribution over X×Y is given by p⋉g or q⋉g respectively (you have a starting distribution, and g lets you take an input in X and get an output in Y). But, breaking it down the other way, we start with the marginal distribution of those joint distributions on Y (the pushforward w.r.t. g), and can write the joint distribution as semidirect product going the other way. Basically, it’s just two different ways of writing the same distributions, so that’s why KL-divergence doesn’t vary at all.
Now, it is also a fact that, for semidirect products (sorry, we’re gonna let p,q,k,j be arbitrary here and unconnected to the fixed ones we were looking at earlier, this is just a general property of semidirect products), we have:
dKL(p⋉k|q⋉j)=dKL(p|q)+Ep[dKL(k(x)|j(x))]
To see this, run through the proof of Proposition 5, because probability distributions are special cases of infradistributions. Running up to right up before the inequality, we had
DKL(P⋉K|Q⋉J)=maxq∈Qminp∈P(dKL(p|q)+Ep[λx.DKL(K(x)|J(x))])
But when we’re dealing with probability distributions, there’s only one possible choice of probability distribution to select, so we just have
dKL(p⋉k|q⋉j)=dKL(p|q)+Ep[dKL(k(x)|j(x))]
Applying this, we have:
dKL(p|q)+Ep[dKL(g(x)|g(x))]=dKL(p⋉g|q⋉g)
=dKL(g∗(p)⋉k|g∗(q)⋉j)=dKL(g∗(p)|g∗(q))+Eg∗(p)[dKL(k(y)|j(y))]
The first equality is our expansion of semidirect product for probability distributions, second equality is the probability distributions being equal, and third equality is, again, expansion of semidirect product for probability distributions. Contracting the two sides of this, we have:
dKL(p|q)+Ep[dKL(g(x)|g(x))]=dKL(g∗(p)|g∗(q))+Eg∗(p)[dKL(k(y)|j(y))]
Now, the KL-divergence between a distribution and itself is 0, so the expectation on the left-hand side is 0, and we have
dKL(p|q)=dKL(g∗(p)|g∗(q))+Eg∗(p)[dKL(k(y)|j(y))]≥dKL(g∗(p)|g∗(q))
And bam, we have dKL(p|q)≥dKL(g∗(p)|g∗(q)) which is what we needed to carry the proof through.