This is nitpicky but I believe the reasoning for 3 is mistaken, mainly because of the counterintuitive claim that minimising cross-entropy over θ is equivalent to maximising KL divergence.
Writing the log-likelihood in the equation
1Nlogℓ=1NN∑i=1lnQθ(xi),
we have the RHS tends to the negative cross entropy (by LLN), instead of just the cross entropy. Then, since H(P)=H(P,Q)−DKL(P||Q), with P being the fixed (true) distribution and Q varying with θ, we have that MLE means we minimise the cross entropy which means we minimise the KL divergence.
This is nitpicky but I believe the reasoning for 3 is mistaken, mainly because of the counterintuitive claim that minimising cross-entropy over θ is equivalent to maximising KL divergence.
Writing the log-likelihood in the equation
1Nlogℓ=1NN∑i=1lnQθ(xi),we have the RHS tends to the negative cross entropy (by LLN), instead of just the cross entropy. Then, since H(P)=H(P,Q)−DKL(P||Q), with P being the fixed (true) distribution and Q varying with θ, we have that MLE means we minimise the cross entropy which means we minimise the KL divergence.