There are two aspects to the Goodhart problem which are often conflated. One is trivially true for all proxy-true utility pairs; but the other is not.
Following this terminology, we’ll say that V is the true goal, and U is the proxy. In the range of circumstances we’re used to, U≈V - that’s what’s makes U a good proxy. Then the Goodhart problem has two aspects to it:
Maximising U does not increase V as much as maximising V would.
When strongly maximising U, V starts to increase at a slower rate, and ultimately starts decreasing.
Aspect 1. is a tautology: the best way to maximise V is to… maximise V. Hence maximising U is almost certainly less effective at increasing V than maximising V directly.
But aspect 2. is not a tautology, and need not be true for generic proxy-true utility pairs (U,V). For instance, some pairs have the reverse Goodhart problem:
When strongly maximising U, V starts to increase at a faster rate, and ultimately starts increasing more than twice as fast as U.
Are there utility functions that have anti-Goodhart problems? Yes, many. If (U,V) have a Goodhart problem, then (U,V′) has an anti-Goodhart problem if V′=2U−V.
Then in the range of circumstances we’re used to, U=2U−U≈2U−V=V′. And, as V starts growing slower than U, V′ starts growing faster; when V starts decreasing, V′ starts growing more than twice as fast as U:
Are there more natural utility functions that have anti-Goodhart problems? Yes. If for instance you’re a total or average utilitarian, and you maximise the proxy “do the best for the worst off”. In general, if V is your true utility and U is a prioritarian/conservative version of V (eg U=−e−V or U=log(V) or other concave, increasing functions) then we have reverse Goodhart behaviour[1].
So saying that we expect Goodhart problems (in the second sense) means that we know something special about V (and U). It’s not a generic problem for all utility functions, but for the ones we expect to correspond to human preferences.
We also need to scale the proxy so that V≈U on the typical range of circumstances; thus the conservatism of U is only visible away from the typical range.
The reverse Goodhart problem
There are two aspects to the Goodhart problem which are often conflated. One is trivially true for all proxy-true utility pairs; but the other is not.
Following this terminology, we’ll say that V is the true goal, and U is the proxy. In the range of circumstances we’re used to, U≈V - that’s what’s makes U a good proxy. Then the Goodhart problem has two aspects to it:
Maximising U does not increase V as much as maximising V would.
When strongly maximising U, V starts to increase at a slower rate, and ultimately starts decreasing.
Aspect 1. is a tautology: the best way to maximise V is to… maximise V. Hence maximising U is almost certainly less effective at increasing V than maximising V directly.
But aspect 2. is not a tautology, and need not be true for generic proxy-true utility pairs (U,V). For instance, some pairs have the reverse Goodhart problem:
When strongly maximising U, V starts to increase at a faster rate, and ultimately starts increasing more than twice as fast as U.
Are there utility functions that have anti-Goodhart problems? Yes, many. If (U,V) have a Goodhart problem, then (U,V′) has an anti-Goodhart problem if V′=2U−V.
Then in the range of circumstances we’re used to, U=2U−U≈2U−V=V′. And, as V starts growing slower than U, V′ starts growing faster; when V starts decreasing, V′ starts growing more than twice as fast as U:
Are there more natural utility functions that have anti-Goodhart problems? Yes. If for instance you’re a total or average utilitarian, and you maximise the proxy “do the best for the worst off”. In general, if V is your true utility and U is a prioritarian/conservative version of V (eg U=−e−V or U=log(V) or other concave, increasing functions) then we have reverse Goodhart behaviour[1].
So saying that we expect Goodhart problems (in the second sense) means that we know something special about V (and U). It’s not a generic problem for all utility functions, but for the ones we expect to correspond to human preferences.
We also need to scale the proxy so that V≈U on the typical range of circumstances; thus the conservatism of U is only visible away from the typical range.