TurnTrout comments on TurnTrout’s shortform feed

TurnTrout 13 Nov 2023 2:35 UTC
15 points
4
Explaining Wasserstein distance. I haven’t seen the following explanation anywhere, and I think it’s better than the rest I’ve seen.
The Wasserstein distance tells you the minimal cost to “move” one probability distribution $μ$ into another $ν$ . It has a lot of nice properties.^[1] Here’s the chunk of math (don’t worry if you don’t follow it):
The Wasserstein 1-distance between two probability measures $μ$ and $ν$ is
$W (μ, ν) = inf γ \in Γ (μ, ν) E_{(x, y) \sim γ} [d (x, y)],$
where $Γ (μ, ν)$ is the set of all couplings of $μ$ and $ν$ .
What’s a “coupling”? It’s a joint probability distribution $γ$ over $(x, y)$ such that its two marginal distributions equal $X$ and $Y$ . However, I like to call these transport plans. Each plan specifies a way to transport a distribution $X$ into another distribution $Y$ :

(EDIT: The $y = x$ line should be flipped.)
Now consider a given point $x$ in $X$ ’s support, say the one with the dotted line below it. $x$ ‘s density must be “reallocated” into $Y$ ‘s distribution. That reallocation is specified by the conditional distribution $γ (Y ∣ X = x)$ , as shown by the vertical dotted line. Marginalizing over $X$ , $γ$ transports all of $X$ ’s density and turns it into $Y$ ! (This is why we required the marginalization.)
Then the Wasserstein 1-distance is simple in this case, where $X, Y$ are distributions on $R^{1}$ . The 1-distance of a plan $γ$ is simply the expected absolute distance from the $y = x$ line!
$E_{(x, y) \sim γ} [d (x, y)] = E_{x \sim X} E_{y \sim γ (Y ∣ X = x)} | x - y | .$
Then the Wasserstein just finds the infimum over all possible transport plans! Spiritually, the Wasserstein 1-distance^[2] tells you the cost of the most efficient way to take each point in $X$ and redistribute it into $Y$ . Just evaluate each transport plan by looking at the expected deviation from the identity line $y = x$ .
Exercise: For $X, Y$ on $R^{1}$ , where $Y$ is $X$ but translated to the right by $t$ , use this explanation to explain why the 1-distance equals $t$ .
1. ^
  The distance is motivated by section 1 of “Optimal Transport and Wasserstein Distance.”
2. ^
  This explanation works for $p$ -distance for $p \neq 1$ , it just makes the math a little more cluttered.
- jsd 13 Nov 2023 15:37 UTC
  1 point
  0
  Parent
  For onlookers, I strongly recommend Gabriel Peyré and Marco Cuturi’s online book Computational Optimal Transport. I also think this is a case where considering discrete distributions helps build intuition.