Observation to potentially connect this to some math that people might be more familiar with: when P and Q are probability distributions, then Gx∼P[Q(x)]=eEx∼P[lnQ(x)]=e−H(P,Q), where H is the cross-entropy.
Note that the cross entropy, (and thus Gx∼P[Q(x)]) is dependent on meaningless details of what events you consider the same vs different, buteH(P,P)−H(P,Q)=Gx∼P[Q(x)]/Gx∼P[P(x)]=Gx∼P[Q(x)/P(x)] is not (as much), and when maximizing with respect to Q, this is the same maximization.
(I am just pointing out that KL divergence is a more natural concept than cross entropy.)
I think e−H(P) might also have a natural interpretation of something along the lines of “The probability that two consecutive samples from P are equal”. This holds exactly for the uniform distribution, but only holds approximately for the Bernoulli distribution, so this is not a perfect heuristic.
Observation to potentially connect this to some math that people might be more familiar with: when P and Q are probability distributions, then Gx∼P[Q(x)]=eEx∼P[lnQ(x)]=e−H(P,Q), where H is the cross-entropy.
Note that the cross entropy, (and thus Gx∼P[Q(x)]) is dependent on meaningless details of what events you consider the same vs different, buteH(P,P)−H(P,Q)=Gx∼P[Q(x)]/Gx∼P[P(x)]=Gx∼P[Q(x)/P(x)] is not (as much), and when maximizing with respect to Q, this is the same maximization.
(I am just pointing out that KL divergence is a more natural concept than cross entropy.)
The middle piece here should be Gx∼P[Q(x)]/Gx∼P[P(x)], right?
Anyway KL-divergence is based.
Yeah, edited.
I think e−H(P) might also have a natural interpretation of something along the lines of “The probability that two consecutive samples from P are equal”. This holds exactly for the uniform distribution, but only holds approximately for the Bernoulli distribution, so this is not a perfect heuristic.