Note that the cross entropy, (and thus Gx∼P[Q(x)]) is dependent on meaningless details of what events you consider the same vs different, buteH(P,P)−H(P,Q)=Gx∼P[Q(x)]/Gx∼P[P(x)]=Gx∼P[Q(x)/P(x)] is not (as much), and when maximizing with respect to Q, this is the same maximization.
(I am just pointing out that KL divergence is a more natural concept than cross entropy.)
Note that the cross entropy, (and thus Gx∼P[Q(x)]) is dependent on meaningless details of what events you consider the same vs different, buteH(P,P)−H(P,Q)=Gx∼P[Q(x)]/Gx∼P[P(x)]=Gx∼P[Q(x)/P(x)] is not (as much), and when maximizing with respect to Q, this is the same maximization.
(I am just pointing out that KL divergence is a more natural concept than cross entropy.)
The middle piece here should be Gx∼P[Q(x)]/Gx∼P[P(x)], right?
Anyway KL-divergence is based.
Yeah, edited.