Explaining Wasserstein distance. I haven’t seen the following explanation anywhere, and I think it’s better than the rest I’ve seen.
The Wasserstein distance tells you the minimal cost to “move” one probability distribution μ into another ν. It has a lot of nice properties.[1] Here’s the chunk of math (don’t worry if you don’t follow it):
The Wasserstein 1-distance between two probability measures μ and ν is
W(μ,ν)=infγ∈Γ(μ,ν)E(x,y)∼γ[d(x,y)],
where Γ(μ,ν) is the set of all couplings of μ and ν.
What’s a “coupling”? It’s a joint probability distribution γ over (x,y) such that its two marginal distributions equal X and Y. However, I like to call these transport plans. Each plan specifies a way to transport a distribution X into another distribution Y:
(EDIT: The y=x line should be flipped.)
Now consider a given point x in X’s support, say the one with the dotted line below it.x‘s density must be “reallocated” into Y‘s distribution. That reallocation is specified by the conditional distribution γ(Y∣X=x), as shown by the vertical dotted line. Marginalizing over X, γ transports all of X’s density and turns it into Y! (This is why we required the marginalization.)
Then the Wasserstein 1-distance is simple in this case, where X,Y are distributions on R1. The 1-distance of a plan γ is simply the expected absolute distance from the y=x line!
E(x,y)∼γ[d(x,y)]=Ex∼XEy∼γ(Y∣X=x)|x−y|.
Then the Wasserstein just finds the infimum over all possible transport plans! Spiritually, the Wasserstein 1-distance[2] tells you the cost of the most efficient way to take each point in X and redistribute it into Y. Just evaluate each transport plan by looking at the expected deviation from the identity line y=x.
Exercise: For X,Y on R1, where Y is X but translated to the right by t, use this explanation to explain why the 1-distance equals t.
For onlookers, I strongly recommend Gabriel Peyré and Marco Cuturi’s online book Computational Optimal Transport. I also think this is a case where considering discrete distributions helps build intuition.
Explaining Wasserstein distance. I haven’t seen the following explanation anywhere, and I think it’s better than the rest I’ve seen.
The Wasserstein distance tells you the minimal cost to “move” one probability distribution μ into another ν. It has a lot of nice properties.[1] Here’s the chunk of math (don’t worry if you don’t follow it):
What’s a “coupling”? It’s a joint probability distribution γ over (x,y) such that its two marginal distributions equal X and Y. However, I like to call these transport plans. Each plan specifies a way to transport a distribution X into another distribution Y:
(EDIT: The y=x line should be flipped.)
Now consider a given point x in X’s support, say the one with the dotted line below it. x‘s density must be “reallocated” into Y‘s distribution. That reallocation is specified by the conditional distribution γ(Y∣X=x), as shown by the vertical dotted line. Marginalizing over X, γ transports all of X’s density and turns it into Y! (This is why we required the marginalization.)
Then the Wasserstein 1-distance is simple in this case, where X,Y are distributions on R1. The 1-distance of a plan γ is simply the expected absolute distance from the y=x line!
E(x,y)∼γ[d(x,y)]=Ex∼XEy∼γ(Y∣X=x)|x−y|.Then the Wasserstein just finds the infimum over all possible transport plans! Spiritually, the Wasserstein 1-distance[2] tells you the cost of the most efficient way to take each point in X and redistribute it into Y. Just evaluate each transport plan by looking at the expected deviation from the identity line y=x.
Exercise: For X,Y on R1, where Y is X but translated to the right by t, use this explanation to explain why the 1-distance equals t.
The distance is motivated by section 1 of “Optimal Transport and Wasserstein Distance.”
This explanation works for p-distance for p≠1, it just makes the math a little more cluttered.
For onlookers, I strongly recommend Gabriel Peyré and Marco Cuturi’s online book Computational Optimal Transport. I also think this is a case where considering discrete distributions helps build intuition.