I’ve done somework on Goodhart’s law, and I’ve argued that we can make use of all our known uncertainties in order to reduce or remove this effect.
Here I’ll look at a very simple case: where we know only one thing, which is that Goodhart’s law exists.
Knowing about Goodhart’s law
Proxies exist
There are two versions of Goodhart’s law, as we commonly use the term. The simplest is that there is a difference between maximising for a proxy -- V -- rather than for the real objective -- U.
Let W=U−V be the difference between the true objective and the proxy. Note that in this post, we’re seeing U, V, and W as actual maps from world histories to R. The equivalence classes of them under positive affine transformations are denoted by [U], [V], [W] and so on.
Note that U=V+W makes sense, as does [U]=[V+W], but [U]=[V]+[W] does not: [V]+[W] defines a family of functions with three degrees of freedom (two scalings and one addition), not the usual two for [U] (one scaling and one addition).
So, the simplest version of Goodhart’s law is thus that there is a chance for W to be non-zero.
Let W be the vector space of possible W, and let p be a probability distribution over it. Assume further that p is symmetric—that for all W∈W, p(W)=p(−W).
Then the pernicious effect of Goodhart’s law is in full effect: suppose we ask an agent to maximise U=V+W, with V known and p giving the distribution over possible W.
Then, given that uncertainty, it will choose the policy π that maximises
E(U∣π)=E(V+∑W∈Wp(W)∣π)=E(V)+0,
since W and −W cancel out.
So the agent will blindly maximise the proxy V.
We know: maximising behaviour is bad
But, most of the time, when we talk about Goodhart’s law, we don’t just mean “a proxy exists”. We mean that not only does a proxy exist, but that maximising the proxy too much is pernicious for the true utility.
Consider for example a nail factory, where U is the number of true nails produced, and V is the number of “straight pieces of metal” produced. Here W=U−V is the difference between the number of true nails and the pieces of metal.
In this case, we expect a powerful agent maximising V to do much, much worse on the U scale. As the agent expands and gets more control over the world’s metal production, V continues to climb, while U tumbles. So V is not only a (bad) proxy for U; at the extremes, its pernicious.
But what if we considered U′=V−W? This is an odd utility indeed; it’s equal to V−(U−V)=2V−U. This is twice the number of pieces of metal produced, minus the number of true nails produced. And as the agent’s power increases, so does V, but so does U′, to an even greater extent. Now, U′ won’t increase to the same extent as it would under the optimal policy for U′, but it still increases massively under V-optimisation.
And, implicitly, when we talk about Goodhart’s law, we generally mean that the true utility is of type U, rather than of type U′; indeed, that things like U′ don’t really make sense as candidates for the true utility. So there is a break in symmetry between W and −W.
Putting this knowledge into figures
So, suppose π0 is some default policy, and π∗V is the V-maximising policy. One way of phrasing the stronger type of Goodhart’s law is that:
E(U∣π0)>E(U∣π∗V).
Then, rearranging these gives:
E(W∣π0)−E(W∣π∗V)>E(V∣π∗V)−E(V∣π0)=C.
Because π∗V is the optimal policy for V, the term C is non-negative (and most likely strictly positive). So the new restriction on W is that:
E(W∣π0)−E(W∣π∗V)>C≥0.
This is an affine restriction, in that if W and W′ satisfy that restriction, then so does a mix qW+(1−q)W′ for 0≤q≤1. In fact, it defines a hyperplane in W, with everything on one side satisfying that restriction, and everything on the other side not satisfying it.
In fact, the set that satisfies the restriction (call it W+) is “smaller” (under p) than the set that does not (call that W−). This is because, if W∈W+, then −W∈W− -- but the converse is not true if C>0.
And now, when an agent maximises U=V+W, with known V and W distributed by p but also known to obey that restriction, the picture is very different. It will maximise V+W′, where W′=∑W∈W+/p(W+), and is far from 0 in general. So the agent won’t just maximise the proxy.
Conclusion
So, even the seemingly trivial fact that we expect a particular type of Goodhart effect—even that trivial fact dramatically reduces the effect of Goodhart’s law.
Now, the effect isn’t enough to converge on a good U: we’ll need to use other information for that. But note one interesting point: the more powerful the agent is, the more effective it is at maximising V, so the higher E(V∣π∗V) gets—and thus the higher C becomes. So the most powerful agents have the strongest restrictions on what the possible U’s are. Note that we might be able to get this effect even for more limited agents, by defining π∗V not only as the optimal policy, but as some miraculous optimal policy where things work out unexpectedly well for the agent.
It will be interesting to see what happens as in situations where we account for more and more of our (implicit and explicit) knowledge about U.
All I know is Goodhart
I’ve done some work on Goodhart’s law, and I’ve argued that we can make use of all our known uncertainties in order to reduce or remove this effect.
Here I’ll look at a very simple case: where we know only one thing, which is that Goodhart’s law exists.
Knowing about Goodhart’s law
Proxies exist
There are two versions of Goodhart’s law, as we commonly use the term. The simplest is that there is a difference between maximising for a proxy -- V -- rather than for the real objective -- U.
Let W=U−V be the difference between the true objective and the proxy. Note that in this post, we’re seeing U, V, and W as actual maps from world histories to R. The equivalence classes of them under positive affine transformations are denoted by [U], [V], [W] and so on.
Note that U=V+W makes sense, as does [U]=[V+W], but [U]=[V]+[W] does not: [V]+[W] defines a family of functions with three degrees of freedom (two scalings and one addition), not the usual two for [U] (one scaling and one addition).
So, the simplest version of Goodhart’s law is thus that there is a chance for W to be non-zero.
Let W be the vector space of possible W, and let p be a probability distribution over it. Assume further that p is symmetric—that for all W∈W, p(W)=p(−W).
Then the pernicious effect of Goodhart’s law is in full effect: suppose we ask an agent to maximise U=V+W, with V known and p giving the distribution over possible W.
Then, given that uncertainty, it will choose the policy π that maximises
E(U∣π)=E(V+∑W∈Wp(W)∣π)=E(V)+0,
since W and −W cancel out.
So the agent will blindly maximise the proxy V.
We know: maximising behaviour is bad
But, most of the time, when we talk about Goodhart’s law, we don’t just mean “a proxy exists”. We mean that not only does a proxy exist, but that maximising the proxy too much is pernicious for the true utility.
Consider for example a nail factory, where U is the number of true nails produced, and V is the number of “straight pieces of metal” produced. Here W=U−V is the difference between the number of true nails and the pieces of metal.
In this case, we expect a powerful agent maximising V to do much, much worse on the U scale. As the agent expands and gets more control over the world’s metal production, V continues to climb, while U tumbles. So V is not only a (bad) proxy for U; at the extremes, its pernicious.
But what if we considered U′=V−W? This is an odd utility indeed; it’s equal to V−(U−V)=2V−U. This is twice the number of pieces of metal produced, minus the number of true nails produced. And as the agent’s power increases, so does V, but so does U′, to an even greater extent. Now, U′ won’t increase to the same extent as it would under the optimal policy for U′, but it still increases massively under V-optimisation.
And, implicitly, when we talk about Goodhart’s law, we generally mean that the true utility is of type U, rather than of type U′; indeed, that things like U′ don’t really make sense as candidates for the true utility. So there is a break in symmetry between W and −W.
Putting this knowledge into figures
So, suppose π0 is some default policy, and π∗V is the V-maximising policy. One way of phrasing the stronger type of Goodhart’s law is that:
E(U∣π0)>E(U∣π∗V).
Then, rearranging these gives:
E(W∣π0)−E(W∣π∗V)>E(V∣π∗V)−E(V∣π0)=C.
Because π∗V is the optimal policy for V, the term C is non-negative (and most likely strictly positive). So the new restriction on W is that:
E(W∣π0)−E(W∣π∗V)>C≥0.
This is an affine restriction, in that if W and W′ satisfy that restriction, then so does a mix qW+(1−q)W′ for 0≤q≤1. In fact, it defines a hyperplane in W, with everything on one side satisfying that restriction, and everything on the other side not satisfying it.
In fact, the set that satisfies the restriction (call it W+) is “smaller” (under p) than the set that does not (call that W−). This is because, if W∈W+, then −W∈W− -- but the converse is not true if C>0.
And now, when an agent maximises U=V+W, with known V and W distributed by p but also known to obey that restriction, the picture is very different. It will maximise V+W′, where W′=∑W∈W+/p(W+), and is far from 0 in general. So the agent won’t just maximise the proxy.
Conclusion
So, even the seemingly trivial fact that we expect a particular type of Goodhart effect—even that trivial fact dramatically reduces the effect of Goodhart’s law.
Now, the effect isn’t enough to converge on a good U: we’ll need to use other information for that. But note one interesting point: the more powerful the agent is, the more effective it is at maximising V, so the higher E(V∣π∗V) gets—and thus the higher C becomes. So the most powerful agents have the strongest restrictions on what the possible U’s are. Note that we might be able to get this effect even for more limited agents, by defining π∗V not only as the optimal policy, but as some miraculous optimal policy where things work out unexpectedly well for the agent.
It will be interesting to see what happens as in situations where we account for more and more of our (implicit and explicit) knowledge about U.