Thanks to Aryan Bhatt, Eric Neyman, and Vivek Hebbar for feedback.
This post gets more math-heavy over time; we convey some intuitions and overall takeaways first, and then get more detailed. Read for as long as you’re getting value out of things!
TLDR
How much should you optimize for a flawed measurement? If you model optimization as selecting for high values of your goal V plus an independent error X, then the answer ends up being very sensitive to the distribution of the error X: if it’s heavy-tailed you shouldn’t optimize too hard, but if it’s light-tailed you can go full speed ahead.
Related work
Why the tails come apart by Thrasymachus discusses a sort of “weak Goodhart” effect, where extremal proxy measurements won’t have extremal values of your goal (even if they’re still pretty good). It implicitly looks at cases similar to a normal distribution.
Scott Garrabrant’s taxonomy of Goodhart’s Law discusses several ways that the law can manifest. This post is about the “Regressional Goodhart” case.
Scaling Laws for Reward Model Overoptimization (Gao et al., 2022) considers very similar conditioning dynamics in real-world RLHF reward models. In their Appendix A, they show a special case of this phenomenon for light-tailed error, which we’ll prove a generalization of in the next post.
Defining and Characterizing Reward Hacking (Skalse et al., 2022) shows that under certain conditions, leaving any terms out of a reward function makes it possible to increase expected proxy return while decreasing expected true return.
How much do you believe your results? by Eric Neyman tackles very similar phenomena to the ones discussed here, particularly in section IV; in this post we’re interested in characterizing that sort of behavior and when it occurs. We strongly recommend reading it first if you’d like better intuitions behind some of the math presented here—though our post was written independently, it’s something of a sequel to Eric’s.
An Arbital page defines Goodhart’s Curse and notes
The exact conditions for Goodhart’s Curse applying between V and a point estimate or probability distribution over U [a proxy measure that an AI is optimizing], have not yet been written out in a convincing way.
To the extent this post adopts a reasonable frame, we think it makes progress towards this goal.
Motivation/intuition
Goodhart’s Law says
When a measure becomes a target, it ceases to be a good measure.
When I (Drake) first heard about Goodhart’s Law, I internalized something like “if you have a goal, and you optimize for a proxy that is less than perfectly correlated with the goal, hard enough optimization for the proxy won’t get you what you wanted.” This was a useful frame to have in my toolbox, but it wasn’t very detailed—I mostly had vague intuitions and some idealized fables from real life.
Much later, I saw some objections to this frame on Goodhart that actually used math.[1] The objection went something like:
Let’s try to sketch out an actual formal model here. What’s the simplest setup of “two correlated measurements”? We could have a joint normal distribution over two random variables, U and V, with zero mean and positive covariance. You actually value V, but you measure a proxy U. Then we can just do the math: if I optimize really hard for U, and give you a random datapoint with U=1012 or something, how much V do you expect to get?
If we look at the joint distribution of U and V, we’ll see a distribution with elliptical contour lines, like so:
Now, the naïve hope is that expected V as a function of observed U would go along the semi-major axis, shown in red below:
But actually we’ll get the blue line, passing through the points at which the ellipses are tangent to the V-axis.[2]
Importantly, though, we’re still getting a line: we get linearly more value V for every additional unit of U we select for! Applying 99th percentile selection on U isn’t going to be as good as 99th percentile selection on V, but it’s still going to give us more V than any lower percentile selection on U.[3] The proxy is inefficient, but it’s not doomed.
Lately, however, I’ve come to think that this story is a little too rosy. One thing that’s going on here is that we’re just thinking about a “regressional Goodhart” problem, which is only one of several ways something Goodhart-like can come into play—see Scott Garrabrant’s taxonomy. But even in this setting, I think things can be much thornier.
In the story above, we can think of our measurement U as being some multiple of V plus an independent normally-distributed source of error, X. When we ask for an outcome with a really high value of U, we’re asking for a datapoint where X+V is very high.[4]
Because normal distributions drop off in probability very fast, it gets harder and harder to select for high values of either component: given that a datapoint is at least 4 standard deviations above the mean, the odds that it’s at least 5 standard deviations above are less than 1%. So the least-rare outcomes with high X+V are going to look like a compromise between the noise X and value V, where we have a medium amount of each piece (because going to the extremes for either one is disproportionately costly in terms of improbability).
To see this more visually, here are some plots of possible (X,V) pairs, restricted to the triangle of values where X+V≥6. Points are brighter if that outcome is more probable, and the black contour lines show regions of equal probability density. On the right, we have the expected value of V as a function of our proxy threshold t.
We can see that the most likely outcomes skew towards one side or the other depending on which of X and V has more variance, but because these contour lines are convex, we still expect to see outcomes that have some of each component.
But now let’s look at a case where X and V are heavy-tailed, such that each additional unit of X or V requires fewer bits of optimization power.[5] Say that the probability density functions (PDFs) of X and V are proportional to exp(−√|x|), instead of exp(−cx2) like before.[6] Then we’ll see something more like
The resulting distribution is symmetric about X and V, of course, but unlike in the normal case, that doesn’t manifest as ”X and V will be about the same”, but instead as “the outcome will be almost entirely X or almost entirely V with even odds”.
In this heavy-tailed regime, though, we care a lot about which of X or V has the edge here. For instance, suppose that optimizing a given amount for V only gets us half as far as it would for X (so e.g. the 99th percentile V value is half as large as the 99th percentile X value). Our plot now looks like
and in the limit for large t we won’t get any expected V at all by optimizing for the sum—all that optimization power goes towards producing high X values. We call this catastrophic Goodhart because the end result, in terms of V, is as bad as if we hadn’t conditioned at all.
(In general, if the right-hand tails of X and V are each on the order of e−xc, we’ll switch between the two regimes right at c=1 - that’s when these contour lines switch from being convex to being concave.)
To help visualize this behavior, let’s zoom in closer on a concrete example where we get catastrophic Goodhart.[7] See below for plots of the PDFs of V and X:
On the left is a standard plot of the two PDFs; on the right is a plot of their negative logarithms. The right-hand plot makes it apparent that X has heavier right tails, because the green line gets arbitrarily far below the orange line in the limit.
Here is a GIF of the conditional distribution on V as t goes from −5 up to 150, with a dashed blue line indicating the conditional expectation:
Note the spike in the conditional PDF around t, corresponding to outcomes where X is small and V is large; because of the heavier tails on X, this spike gets smaller and smaller with larger t. (We recommend staring at this GIF until you feel like you have a good understanding of why it looks the way it does.)
The expected value initially goes up when we apply a little selection pressure to our proxy, but as we optimize harder, that optimization pressure gets shunted more and more into optimization for X, and less and less for V, even in absolute terms. (This is the same dynamic that Eric Neyman recently discussed in section IV of How much do you believe your results?, put in a slightly different framing.)
In the next post, we’re going to prove some results about when this effect happens; this will be pretty technical, so we’ll talk a bit about the results in broad strokes here.
Proof statement
Suppose that X and V are independent real-valued random variables. We’ll show, roughly, that if
X is subexponential (a slightly stronger property than being heavy-tailed).
V has lighter tails than X by more than a linear factor, meaning that the ratio of the tails of V and the tails of X grows superlinearly.[8]
then limt→∞E[V|X+V≥t]=E[V].
Less formally, we’re saying something like “if it requires relatively little selection pressure on X to get more of X and asymptotically more selection pressure on V to get more of V, then applying very strong optimization towards X+V will not get you even a little bit of optimization towards V - all the optimization power will go towards X, where it has the best return on investment.”
We’ll also show a sort of inverse to this: if X has right tails that are lighter than an exponential (for instance, if X is normal or bounded), then we’ll get infinitely much V in the limit no matter what kind of tail distribution V has.
(What if X is heavy-tailed but V has even heavier tails than X? Then we can exchange their places in the first theorem, and conclude that we get zero X in the limit—which means that all of that optimization is going towards V.)
We might want to use unaligned AI to generate alignment research for us. One model for this is sampling a random document from the space of 10000-bit strings, then conditioning on a high human rating. If evaluation of alignment proposals is substantially easier than generating good alignment proposals, these plans will be useful. If not, we’ll have a hard time getting research out of the AI. This is a crux between John Wentworth and Paul Christiano + Jan Leike that informs their differing approaches to alignment.
We can frame the problem of evaluation in terms of Goodhart’s Law. Let V be the true quality of an alignment plan (say in utility contributed to the future), and U=X+V be the human rating, so that X is the human’s rating error. If V and X are independent, and we have access to arbitrarily strong optimization for U, then our result implies that to implement an alignment plan better than random…
… if V is light-tailed, X must not be heavy-tailed.
… if V is heavy-tailed, X must not be much heavier-tailed than V.
We don’t know whether V is heavy- or light-tailed in real life, so to be safe, we should make X light-tailed. To the extent this model is accurate, a large part of alignment reduces to the problem of finding a classifier with light-tailed errors, which is able to operate in the exceptionally complicated domain of evaluating plans, and is not itself dangerous.
This model makes two really strong assumptions: that optimization is like conditioning, and that X and V are independent. These are violated in real life:
X and V will not be independent. Among other reasons, we expect that more complicated or optimized plans are more likely to have large impacts on the world (thus having higher variance of V), and harder to evaluate (thus having higher variance of X). However, in some cases, really good plans might be easier to evaluate; for example, formalized proofs can be efficiently checked.
There’s also a sort of implicit assumption in even using a framing that thinks about things as X+V; the world might be better thought of as naturally containing (U,V) tuples (with U our proxy measurement), and X=U−V could be a sort of unnatural construction that doesn’t make sense to single out in the real world. (We do think this framing is relatively natural, but won’t get into justifications here.)
Despite these caveats, some takeaways we endorse:
Optimization for imperfect proxies is sometimes fine and sometimes doomed, depending on your distribution.
Goodhart’s law is subtle—even within a given framing of a problem, what happens when you optimize can be very sensitive to the exact numerical details of your measurements.
In particular, reaching for a normally-distributed toy model by default can be super misleading for thinking about a lot of real-world dynamics, because the tails are much lighter than most things in a way that affects the qualitative takeaways.
In an alignment plan involving generation and evaluation, you should either (a) have reason to believe that your classifier’s errors are light-tailed, (b) have a reason why training an AI on human (or AI) feedback will be importantly different from conditioning on high feedback scores, or (c) have a story for why non-independence works in your favor.
Exercises
Show that when X and V are independent and t∈R, E[V|X+V>t]≥E[V]. Conclude that limt→∞E[V|X+V>t]≥E[V]. This means that given independence, optimization always produces a plan that is no worse than random.
When independence is violated, an optimized plan can be worse than random, even if your evaluator is unbiased. Construct a joint distribution fVX for X and V such that E[X]=0, E[V]=0, and E[X|V=v]=0 for any v∈R, but limt→∞E[V|X+V>t]=−∞.
Answers to exercises are at the end of the next post.
One way to see this intuitively is to consider the shear transformation replacing V by V−cU, where c is a constant such that the resulting random variable is uncorrelated with U. In that situation we’d have a constant expectation of 0, so adding the U component back in should give us a linear expectation.
Most heavy-tailed distributions are also long-tailed, which means that limx→∞Pr(X>x+t)Pr(X>x)=1 for all t>0. So the optimization needed to get from the event ”X is at least x” to ”X is at least x+t” becomes arbitrarily small for large x.
We’ll suppose that X has a PDF proportional to e−2√|x| and V has a PDF proportional to e−(vS(v))0.8, where S(v)=ev−1ev+1 is an odd function that quickly asymptotes to sign(v), so V has tails like e−v0.8 for large v in either direction but is smooth around v=0.
When is Goodhart catastrophic?
Thanks to Aryan Bhatt, Eric Neyman, and Vivek Hebbar for feedback.
This post gets more math-heavy over time; we convey some intuitions and overall takeaways first, and then get more detailed. Read for as long as you’re getting value out of things!
TLDR
How much should you optimize for a flawed measurement? If you model optimization as selecting for high values of your goal V plus an independent error X, then the answer ends up being very sensitive to the distribution of the error X: if it’s heavy-tailed you shouldn’t optimize too hard, but if it’s light-tailed you can go full speed ahead.
Related work
Why the tails come apart by Thrasymachus discusses a sort of “weak Goodhart” effect, where extremal proxy measurements won’t have extremal values of your goal (even if they’re still pretty good). It implicitly looks at cases similar to a normal distribution.
Scott Garrabrant’s taxonomy of Goodhart’s Law discusses several ways that the law can manifest. This post is about the “Regressional Goodhart” case.
Scaling Laws for Reward Model Overoptimization (Gao et al., 2022) considers very similar conditioning dynamics in real-world RLHF reward models. In their Appendix A, they show a special case of this phenomenon for light-tailed error, which we’ll prove a generalization of in the next post.
Defining and Characterizing Reward Hacking (Skalse et al., 2022) shows that under certain conditions, leaving any terms out of a reward function makes it possible to increase expected proxy return while decreasing expected true return.
How much do you believe your results? by Eric Neyman tackles very similar phenomena to the ones discussed here, particularly in section IV; in this post we’re interested in characterizing that sort of behavior and when it occurs. We strongly recommend reading it first if you’d like better intuitions behind some of the math presented here—though our post was written independently, it’s something of a sequel to Eric’s.
An Arbital page defines Goodhart’s Curse and notes
To the extent this post adopts a reasonable frame, we think it makes progress towards this goal.
Motivation/intuition
Goodhart’s Law says
When I (Drake) first heard about Goodhart’s Law, I internalized something like “if you have a goal, and you optimize for a proxy that is less than perfectly correlated with the goal, hard enough optimization for the proxy won’t get you what you wanted.” This was a useful frame to have in my toolbox, but it wasn’t very detailed—I mostly had vague intuitions and some idealized fables from real life.
Much later, I saw some objections to this frame on Goodhart that actually used math.[1] The objection went something like:
Lately, however, I’ve come to think that this story is a little too rosy. One thing that’s going on here is that we’re just thinking about a “regressional Goodhart” problem, which is only one of several ways something Goodhart-like can come into play—see Scott Garrabrant’s taxonomy. But even in this setting, I think things can be much thornier.
In the story above, we can think of our measurement U as being some multiple of V plus an independent normally-distributed source of error, X. When we ask for an outcome with a really high value of U, we’re asking for a datapoint where X+V is very high.[4]
Because normal distributions drop off in probability very fast, it gets harder and harder to select for high values of either component: given that a datapoint is at least 4 standard deviations above the mean, the odds that it’s at least 5 standard deviations above are less than 1%. So the least-rare outcomes with high X+V are going to look like a compromise between the noise X and value V, where we have a medium amount of each piece (because going to the extremes for either one is disproportionately costly in terms of improbability).
To see this more visually, here are some plots of possible (X,V) pairs, restricted to the triangle of values where X+V≥6. Points are brighter if that outcome is more probable, and the black contour lines show regions of equal probability density. On the right, we have the expected value of V as a function of our proxy threshold t.
We can see that the most likely outcomes skew towards one side or the other depending on which of X and V has more variance, but because these contour lines are convex, we still expect to see outcomes that have some of each component.
But now let’s look at a case where X and V are heavy-tailed, such that each additional unit of X or V requires fewer bits of optimization power.[5] Say that the probability density functions (PDFs) of X and V are proportional to exp(−√|x|), instead of exp(−cx2) like before.[6] Then we’ll see something more like
The resulting distribution is symmetric about X and V, of course, but unlike in the normal case, that doesn’t manifest as ”X and V will be about the same”, but instead as “the outcome will be almost entirely X or almost entirely V with even odds”.
In this heavy-tailed regime, though, we care a lot about which of X or V has the edge here. For instance, suppose that optimizing a given amount for V only gets us half as far as it would for X (so e.g. the 99th percentile V value is half as large as the 99th percentile X value). Our plot now looks like
and in the limit for large t we won’t get any expected V at all by optimizing for the sum—all that optimization power goes towards producing high X values. We call this catastrophic Goodhart because the end result, in terms of V, is as bad as if we hadn’t conditioned at all.
(In general, if the right-hand tails of X and V are each on the order of e−xc, we’ll switch between the two regimes right at c=1 - that’s when these contour lines switch from being convex to being concave.)
To help visualize this behavior, let’s zoom in closer on a concrete example where we get catastrophic Goodhart.[7] See below for plots of the PDFs of V and X:
On the left is a standard plot of the two PDFs; on the right is a plot of their negative logarithms. The right-hand plot makes it apparent that X has heavier right tails, because the green line gets arbitrarily far below the orange line in the limit.
Here is a GIF of the conditional distribution on V as t goes from −5 up to 150, with a dashed blue line indicating the conditional expectation:
Note the spike in the conditional PDF around t, corresponding to outcomes where X is small and V is large; because of the heavier tails on X, this spike gets smaller and smaller with larger t. (We recommend staring at this GIF until you feel like you have a good understanding of why it looks the way it does.)
The expected value initially goes up when we apply a little selection pressure to our proxy, but as we optimize harder, that optimization pressure gets shunted more and more into optimization for X, and less and less for V, even in absolute terms. (This is the same dynamic that Eric Neyman recently discussed in section IV of How much do you believe your results?, put in a slightly different framing.)
In the next post, we’re going to prove some results about when this effect happens; this will be pretty technical, so we’ll talk a bit about the results in broad strokes here.
Proof statement
Suppose that X and V are independent real-valued random variables. We’ll show, roughly, that if
X is subexponential (a slightly stronger property than being heavy-tailed).
V has lighter tails than X by more than a linear factor, meaning that the ratio of the tails of V and the tails of X grows superlinearly.[8]
then limt→∞E[V|X+V≥t]=E[V].
Less formally, we’re saying something like “if it requires relatively little selection pressure on X to get more of X and asymptotically more selection pressure on V to get more of V, then applying very strong optimization towards X+V will not get you even a little bit of optimization towards V - all the optimization power will go towards X, where it has the best return on investment.”
We’ll also show a sort of inverse to this: if X has right tails that are lighter than an exponential (for instance, if X is normal or bounded), then we’ll get infinitely much V in the limit no matter what kind of tail distribution V has.
(What if X is heavy-tailed but V has even heavier tails than X? Then we can exchange their places in the first theorem, and conclude that we get zero X in the limit—which means that all of that optimization is going towards V.)
In the next post, we’ll prove these claims.
Application to alignment
We might want to use unaligned AI to generate alignment research for us. One model for this is sampling a random document from the space of 10000-bit strings, then conditioning on a high human rating. If evaluation of alignment proposals is substantially easier than generating good alignment proposals, these plans will be useful. If not, we’ll have a hard time getting research out of the AI. This is a crux between John Wentworth and Paul Christiano + Jan Leike that informs their differing approaches to alignment.
We can frame the problem of evaluation in terms of Goodhart’s Law. Let V be the true quality of an alignment plan (say in utility contributed to the future), and U=X+V be the human rating, so that X is the human’s rating error. If V and X are independent, and we have access to arbitrarily strong optimization for U, then our result implies that to implement an alignment plan better than random…
… if V is light-tailed, X must not be heavy-tailed.
… if V is heavy-tailed, X must not be much heavier-tailed than V.
We don’t know whether V is heavy- or light-tailed in real life, so to be safe, we should make X light-tailed. To the extent this model is accurate, a large part of alignment reduces to the problem of finding a classifier with light-tailed errors, which is able to operate in the exceptionally complicated domain of evaluating plans, and is not itself dangerous.
This model makes two really strong assumptions: that optimization is like conditioning, and that X and V are independent. These are violated in real life:
Optimization is not simply conditioning; SGD has too many inductive biases for us to list here, and (Gao et al., 2022) found that for a given level of optimization, RL uses far more KL distance from the prior than best-of-n sampling.
X and V will not be independent. Among other reasons, we expect that more complicated or optimized plans are more likely to have large impacts on the world (thus having higher variance of V), and harder to evaluate (thus having higher variance of X). However, in some cases, really good plans might be easier to evaluate; for example, formalized proofs can be efficiently checked.
There’s also a sort of implicit assumption in even using a framing that thinks about things as X+V; the world might be better thought of as naturally containing (U,V) tuples (with U our proxy measurement), and X=U−V could be a sort of unnatural construction that doesn’t make sense to single out in the real world. (We do think this framing is relatively natural, but won’t get into justifications here.)
Despite these caveats, some takeaways we endorse:
Optimization for imperfect proxies is sometimes fine and sometimes doomed, depending on your distribution.
Goodhart’s law is subtle—even within a given framing of a problem, what happens when you optimize can be very sensitive to the exact numerical details of your measurements.
In particular, reaching for a normally-distributed toy model by default can be super misleading for thinking about a lot of real-world dynamics, because the tails are much lighter than most things in a way that affects the qualitative takeaways.
In an alignment plan involving generation and evaluation, you should either (a) have reason to believe that your classifier’s errors are light-tailed, (b) have a reason why training an AI on human (or AI) feedback will be importantly different from conditioning on high feedback scores, or (c) have a story for why non-independence works in your favor.
Exercises
Show that when X and V are independent and t∈R, E[V | X+V>t]≥E[V]. Conclude that limt→∞E[V | X+V>t]≥E[V]. This means that given independence, optimization always produces a plan that is no worse than random.
When independence is violated, an optimized plan can be worse than random, even if your evaluator is unbiased. Construct a joint distribution fVX for X and V such that E[X]=0, E[V]=0, and E[X|V=v]=0 for any v∈R, but limt→∞E[V | X+V>t]=−∞.
Answers to exercises are at the end of the next post.
Thanks to Eric Neyman for first making this observation clear to me.
One way to see this intuitively is to consider the shear transformation replacing V by V−cU, where c is a constant such that the resulting random variable is uncorrelated with U. In that situation we’d have a constant expectation of 0, so adding the U component back in should give us a linear expectation.
To be precise, E[V|U=t]=t⋅Cov(U,V)Var(U).
Technically we could have U=cV+X, but we can just rescale U until the V coefficient is 1 without changing anything.
Most heavy-tailed distributions are also long-tailed, which means that limx→∞Pr(X>x+t)Pr(X>x)=1 for all t>0. So the optimization needed to get from the event ”X is at least x” to ”X is at least x+t” becomes arbitrarily small for large x.
Note that this effect doesn’t depend on the behavior of X or V right around zero, just on their right tails.
We’ll suppose that X has a PDF proportional to e−2√|x| and V has a PDF proportional to e−(vS(v))0.8, where S(v)=ev−1ev+1 is an odd function that quickly asymptotes to sign(v), so V has tails like e−v0.8 for large v in either direction but is smooth around v=0.
We’ll use something slightly stronger than this; we’d like X’s tails to be larger by a factor of t1+ϵ. More precise details in the next post.