A key point (from here) about value fragility which I think this post is importantly missing: Goodhart problems are about generalization, not approximation.
Suppose I have a proxy u′ for a true utility function u, and u′ is always within ϵ of u (i.e. |u′−u|<ϵ). I maximize u′. Then the true utility u achieved will be within 2ϵ of the maximum achievable utility. Reasoning: in the worst case, u′ is ϵ lower than u at the u-maximizing point, and ϵ higher than u at the u′-maximizing point.
Point is: if a proxy is close to the true utility function everywhere, then we will indeed achieve close-to-maximal utility upon maximizing the proxy. Goodhart problems require the proxy to not even be approximately close, in at least some places.
When we look at real-world Goodhart problems, they indeed involve situations where some approximation only works well within some region, and ceases to even be a good approximation once we move well outside that region. That’s a generalization problem, not an approximation problem.
So approximations are fine, so long as they generalize well.
I think section 3 still mostly stands, but the arguments to get there change mildly. Section 4 changes a lot more: the distinction between “A’s values, according to A” vs “A’s values, according to B” becomes crucial—i.e. A may have a very different idea than B of what it means for A’s values to be satisfied in extreme out-of-distribution contexts. In the hard version of the problem, there isn’t any clear privileged notion of what “A’s values, according to A” would even mean far out-of-distribution.
A key point (from here) about value fragility which I think this post is importantly missing: Goodhart problems are about generalization, not approximation.
Suppose I have a proxy u′ for a true utility function u, and u′ is always within ϵ of u (i.e. |u′−u|<ϵ). I maximize u′. Then the true utility u achieved will be within 2ϵ of the maximum achievable utility. Reasoning: in the worst case, u′ is ϵ lower than u at the u-maximizing point, and ϵ higher than u at the u′-maximizing point.
Point is: if a proxy is close to the true utility function everywhere, then we will indeed achieve close-to-maximal utility upon maximizing the proxy. Goodhart problems require the proxy to not even be approximately close, in at least some places.
When we look at real-world Goodhart problems, they indeed involve situations where some approximation only works well within some region, and ceases to even be a good approximation once we move well outside that region. That’s a generalization problem, not an approximation problem.
So approximations are fine, so long as they generalize well.
Can you say more about how this insight undermines or otherwise changes the conclusions of the post?
I think section 3 still mostly stands, but the arguments to get there change mildly. Section 4 changes a lot more: the distinction between “A’s values, according to A” vs “A’s values, according to B” becomes crucial—i.e. A may have a very different idea than B of what it means for A’s values to be satisfied in extreme out-of-distribution contexts. In the hard version of the problem, there isn’t any clear privileged notion of what “A’s values, according to A” would even mean far out-of-distribution.