Related reading linking mutual information to best possible classifier:
https://arxiv.org/pdf/1801.04062.pdf
This one talks about estimating KL divergence and Mutual Information using neural networks, but I’m specifically linking it to show y’all Theorem 1,
Theorem 1 (Donsker-Varadhan representation). The KL divergence admits the following dual representation:
DKL(P||Q)=supT:Q−REP[T]−log(EQ[ET])
This links the mutual information to the best possible regression. But I haven’t figured out exactly how to parse / interpret this.
Related reading linking mutual information to best possible classifier:
https://arxiv.org/pdf/1801.04062.pdf
This one talks about estimating KL divergence and Mutual Information using neural networks, but I’m specifically linking it to show y’all Theorem 1,
Theorem 1 (Donsker-Varadhan representation). The KL divergence admits the following dual representation:
DKL(P||Q)=supT:Q−REP[T]−log(EQ[ET])
This links the mutual information to the best possible regression. But I haven’t figured out exactly how to parse / interpret this.