Another intuition I often found useful: KL-divergence behaves more like the square of a metric than a metric.
The clearest indicator of this is that KL-divergence satisfies a kind of Pythagorean theorem established in a paper by Csiszár (1975), see https://www.jstor.org/stable/2959270#metadata_info_tab_contents . The intuition is exactly the same as for the euclidean case: If we project a point A onto a convex set S (say the projection is B), and if C is another point in the set S, then the standard Pythagorean theorem would tell us that the angle of the triangle ABC at B is larger than 90 degree, or in other words |A−C|2>=|A−B|2+|B−C|2. And the same holds if we project with respect to KL divergence, and we end up having DKL(C,A)>=DKL(B,A)+DKL(C,B).
This has implications if you think about things like sample efficiency (instead of a square root rate as usual, convergence rates with KL divergence usually behave like 1/n).
This is also reflected in the relation between KL divergence and other distances for probability measures, like total variation or Wasserstein distance. The most prominent example would be Pinsker’s inequality in this regard, stating that the total variation norm between two measures is bounded by a constant times the square root of the KL-divergence between the measures.
This intuition—that the KL is a metric-squared—is indeed important for understanding the KL divergence. It’s a property that all divergences have in common. Divergences can be thought of as generalizations of the Euclidean metric where you replace the quadratic—which is in some sense the Platonic convex function—with a convex function of your choice.
This intuition is also important for understanding Talagrand’s T2 inequality which says that, under certain conditions like strong log-concavity of the reference measure q, the Wasserstein-2 distance (which is analogous to the Euclidean metric-squared only lifted as a metric on the space of probability measures) between the two probability measures p and q can be upperbounded by their KL divergence.
Another intuition I often found useful: KL-divergence behaves more like the square of a metric than a metric.
The clearest indicator of this is that KL-divergence satisfies a kind of Pythagorean theorem established in a paper by Csiszár (1975), see https://www.jstor.org/stable/2959270#metadata_info_tab_contents . The intuition is exactly the same as for the euclidean case: If we project a point A onto a convex set S (say the projection is B), and if C is another point in the set S, then the standard Pythagorean theorem would tell us that the angle of the triangle ABC at B is larger than 90 degree, or in other words |A−C|2>=|A−B|2+|B−C|2. And the same holds if we project with respect to KL divergence, and we end up having DKL(C,A)>=DKL(B,A)+DKL(C,B).
This has implications if you think about things like sample efficiency (instead of a square root rate as usual, convergence rates with KL divergence usually behave like 1/n).
This is also reflected in the relation between KL divergence and other distances for probability measures, like total variation or Wasserstein distance. The most prominent example would be Pinsker’s inequality in this regard, stating that the total variation norm between two measures is bounded by a constant times the square root of the KL-divergence between the measures.
This intuition—that the KL is a metric-squared—is indeed important for understanding the KL divergence. It’s a property that all divergences have in common. Divergences can be thought of as generalizations of the Euclidean metric where you replace the quadratic—which is in some sense the Platonic convex function—with a convex function of your choice.
This intuition is also important for understanding Talagrand’s T2 inequality which says that, under certain conditions like strong log-concavity of the reference measure q, the Wasserstein-2 distance (which is analogous to the Euclidean metric-squared only lifted as a metric on the space of probability measures) between the two probability measures p and q can be upperbounded by their KL divergence.