Sierra Mike comments on Six (and a half) intuitions for KL divergence

Sierra Mike 8 Jul 2023 15:58 UTC
2 points
0
This is nitpicky but I believe the reasoning for 3 is mistaken, mainly because of the counterintuitive claim that minimising cross-entropy over $θ$ is equivalent to maximising KL divergence.
Writing the log-likelihood in the equation
$\frac{1}{N} log ℓ = \frac{1}{N} N \sum i = 1 ln Q_{θ} (x_{i}),$
we have the RHS tends to the negative cross entropy (by LLN), instead of just the cross entropy. Then, since $H (P) = H (P, Q) - D_{K L} (P | | Q),$ with $P$ being the fixed (true) distribution and $Q$ varying with $θ$ , we have that MLE means we minimise the cross entropy which means we minimise the KL divergence.