By the law of large numbers, 1N∑Ni=1lnQθ(xi)→∑xP(x)lnQθ(x) almost surely. This is the cross entropy of P and Qθ. Also note that if we subtract this from the entropy of P, we get DKL(P||Qθ). So minimising the cross entropy over θ is equivalent to maximising DKL(P||Qθ).
I think the cross entropy of P and Qθ is actually H(P,Qθ)=−∑xP(x)lnQθ(x) (note the negative sign). The entropy of P is H(P)=−∑xP(x)lnP(x). Since DKL(P||Qθ)=∑xP(x)(ln(P(x)−lnQθ(x))=∑xP(x)lnP(x)−∑xP(x)lnQθ(x)=−H(P)+H(P,Qθ)then the KL divergence is actually the cross entropy minus the entropy, not the other way around. So minimising the cross entropy over θ will minimise (not maximise) the KL divergence.
I believe the next paragraph is still correct: the maximum likelihood estimator θ∗ is the parameter which maximises L(^Pn;Qθ), which minimises the cross-entropy, which minimises the KL divergence.
Apologies if any of what I’ve said above is incorrect, I’m not an expert on this.
I think the cross entropy of P and Qθ is actually H(P,Qθ)=−∑xP(x)lnQθ(x) (note the negative sign). The entropy of P is H(P)=−∑xP(x)lnP(x). Since DKL(P||Qθ)=∑xP(x)(ln(P(x)−lnQθ(x))=∑xP(x)lnP(x)−∑xP(x)lnQθ(x)=−H(P)+H(P,Qθ)then the KL divergence is actually the cross entropy minus the entropy, not the other way around. So minimising the cross entropy over θ will minimise (not maximise) the KL divergence.
I believe the next paragraph is still correct: the maximum likelihood estimator θ∗ is the parameter which maximises L(^Pn;Qθ), which minimises the cross-entropy, which minimises the KL divergence.
Apologies if any of what I’ve said above is incorrect, I’m not an expert on this.