It seems I didn’t articulate my point clearly. What I was saying is that V and V’ are equally hard to define, yet we all assume that true human values has a Goodhart problem (rather than a reverse Goodhart problem). This can’t be because of the complexity (since the complexity is equal) nor because we are maximising a proxy (because both have the same proxy).
So there is something specific about (our knowledge of) human values which causes us to expect Goodhart problems rather than reverse Goodhart problems. It’s not too hard to think of plausible explanations (fragility of value can be re-expressed in terms of simple underlying variables to get results like this), but it does need explaining. And it might not always be valid (eg if we used different underlying variables, such as the smooth-mins of the ones we previously used, then fragility of value and Goodhart effects are much weaker), so we may need to worry about them less in some circumstances.
I think it’s empirical observation. Goodhart looked around, saw in many domains that U diverged from V in a bad way after it became a tracked metric, while seeing no examples of U diverging from a theoretical V’ in a good way, and then minted the “law.”
Upon further analysis, no-one has come up with a counterexample not already covered by the built in exceptions (if U is sufficiently close to V, then maximizing U is fine—eg Moneyball; OR if there is relatively low benefit to perform, agents won’t attempt to maximize U—eg anything using Age as U like senior discounts or school placements)
The world doesn’t just happen to behave in a certain way. The probability that all examples point in a single direction without some actual mechanism causing it is negligible.
It seems I didn’t articulate my point clearly. What I was saying is that V and V’ are equally hard to define, yet we all assume that true human values has a Goodhart problem (rather than a reverse Goodhart problem). This can’t be because of the complexity (since the complexity is equal) nor because we are maximising a proxy (because both have the same proxy).
So there is something specific about (our knowledge of) human values which causes us to expect Goodhart problems rather than reverse Goodhart problems. It’s not too hard to think of plausible explanations (fragility of value can be re-expressed in terms of simple underlying variables to get results like this), but it does need explaining. And it might not always be valid (eg if we used different underlying variables, such as the smooth-mins of the ones we previously used, then fragility of value and Goodhart effects are much weaker), so we may need to worry about them less in some circumstances.
I think it’s empirical observation. Goodhart looked around, saw in many domains that U diverged from V in a bad way after it became a tracked metric, while seeing no examples of U diverging from a theoretical V’ in a good way, and then minted the “law.” Upon further analysis, no-one has come up with a counterexample not already covered by the built in exceptions (if U is sufficiently close to V, then maximizing U is fine—eg Moneyball; OR if there is relatively low benefit to perform, agents won’t attempt to maximize U—eg anything using Age as U like senior discounts or school placements)
The world doesn’t just happen to behave in a certain way. The probability that all examples point in a single direction without some actual mechanism causing it is negligible.