Cedar comments on Utility Maximization = Description Length Minimization

Cedar 9 Apr 2022 15:02 UTC
3 points
Related reading linking mutual information to best possible classifier:
https://arxiv.org/pdf/1801.04062.pdf
This one talks about estimating KL divergence and Mutual Information using neural networks, but I’m specifically linking it to show y’all Theorem 1,
Theorem 1 (Donsker-Varadhan representation). The KL divergence admits the following dual representation:
$D_{K L} (P | | Q) = s u p_{T : Q - R} E_{P} [T] - l o g (E_{Q} [E_{T}])$
This links the mutual information to the best possible regression. But I haven’t figured out exactly how to parse / interpret this.