IMO the main concept to deeply understand when studying information theory is the notion of information content/self-information/Shannon-information. Most other things seems to be applications or expansions on this concept. For example entropy is just the expected information content when sampling from a distribution. Mutual information is the shared information content in two distributions. KL-divergence describes how much information content you’re getting relative to your choice of encoding. Information gain is the difference in information content after and before you drew a sample.
IMO the main concept to deeply understand when studying information theory is the notion of information content/self-information/Shannon-information. Most other things seems to be applications or expansions on this concept. For example entropy is just the expected information content when sampling from a distribution. Mutual information is the shared information content in two distributions. KL-divergence describes how much information content you’re getting relative to your choice of encoding. Information gain is the difference in information content after and before you drew a sample.
For this I would recommend this essay written by me. I would also recommend Terrence Tao’s post on internet anonymity. Or if you’ve seen Death Note, Gwern’s post on the mistakes of Light. Also this video on KL divergence. And this video by intelligent systems lab.