If we make the (false) assumption that we both have utility/reward functions, and E_U(V) is the expected utility of utility V if we assume a U maximiser is maximising it, then we can measure the distance between utility U and V as d(U,V)=E_U(U)-E_V(U).
This is non-symmetric and doesn’t obey the triangle inequality, but it is a very natural measure—it represents the cost to U to replace a U-maximiser with a V-maximiser.
If we make the (false) assumption that we both have utility/reward functions, and E_U(V) is the expected utility of utility V if we assume a U maximiser is maximising it, then we can measure the distance between utility U and V as d(U,V)=E_U(U)-E_V(U).
This is non-symmetric and doesn’t obey the triangle inequality, but it is a very natural measure—it represents the cost to U to replace a U-maximiser with a V-maximiser.