It seems more principled, equally effective, and much more practical, to simply take the policy that optimizes E[u] - (E[v] - v0)^2, where v0 is the expected value of v given some baseline “do nothing” policy. You can sum over many different v’s to give a harsher requirement. I don’t know if the machinery with counterfactuals etc. is adding much beyond this.
It seems more principled, equally effective, and much more practical, to simply take the policy that optimizes E[u] - (E[v] - v0)^2, where v0 is the expected value of v given some baseline “do nothing” policy. You can sum over many different v’s to give a harsher requirement. I don’t know if the machinery with counterfactuals etc. is adding much beyond this.
Yep, that seems sensible (I assume you meant E[u] - (E[v] - v0)^2 ?)
Yes, fixed.