Raemon comments on Goodhart Taxonomy

Raemon Jan 7, 2018, 5:43 AM
LW: 18 AF: 4
0
AF
I liked this post but wished it was short enough to store it all in my working memory. Partly because of the site formatting, partly because I think it was written as if it were an essay instead of a short reference post (which seems reasonable for the OP), I found it hard to scroll through without losing my train of thought.
I thought I’d try shortening it slightly and see if I could make it easier to parse. (Also collating various examples people came up with)
...
...
Goodhart Taxonomy
Goodhart’s Law states that “any observed statistical regularity will tend to collapse once pressure is placed upon it for control purposes.” However, this is not a single phenomenon. I propose that there are (at least) four different mechanisms through which proxy measures break when you optimize for them:
- Regressional—When selecting for a proxy measure, you select not only for the true goal, but also for the difference between the proxy and the goal.
  - Model: When U is equal to V+X, where X is some noise, a point with a large U value will likely have a large V value, but also a large X value. Thus, when U is large, you can expect V to be predictably smaller than U.
  - Example: height is correlated with basketball ability, and does actually directly help, but the best player is only 6′3″, and a random 7′ person in their 20s would probably not be as good
- Causal - When there is a non-causal correlation between the proxy and the goal, intervening on the proxy may fail to intervene on the goal.
  - Model: If V causes U (or if V and U are both caused by some third thing), then a correlation between V and U may be observed. However, when you intervene to increase U through some mechanism that does not involve V, you will fail to also increase V.
  - Example: an early 1900s college basketball team gets all of their players high-heeled shoes, because tallness causes people to be better at basketball. Instead, the players are slowed and get more foot injuries.
- Extremal—Worlds in which the proxy takes an extreme value may be very different from the ordinary worlds in which the correlation between the proxy and the goal was observed.
  - Model: Patterns tend to break at simple joints. One simple subset of worlds is those worlds in which U is very large. Thus, a strong correlation between U and V observed for naturally occuring U values may not transfer to worlds in which U is very large. Further, since there may be relatively few naturally occuring worlds in which U is very large, extremely large U may coincide with small V values without breaking the statistical correlation.
  - Example: the tallest person on record, Robert Wadlow, was 8′11″ (2.72m). He grew to that height because of a pituitary disorder, he would have struggled to play basketball because he “required leg braces to walk and had little feeling in his legs and feet.”
- Adversarial—When you optimize for a proxy, you provide an incentive for adversaries to correlate their goal with your proxy, thus destroying the correlation with your goal.
  - Model: Consider an agent A with some different goal W. Since they depend on common resources, W and V are naturally opposed. If you optimize U as a proxy for V, and Aknows this, A is incentivized to make large U values coincide with large W values, thus stopping them from coinciding with large V values.
  - Example: aspiring NBA players might just lie about their height.
[note: I think most of the value of this came from the above list, but am curious if people find the rest of the post below easier to parse]
Regressional Goodhart
When selecting for a proxy measure, you select not only for the true goal, but also for the difference between the proxy and the goal.
Abstract Model
When U is equal to V+X, where X is some noise, a point with a large U value will likely have a large V value, but also a large X value. Thus, when U is large, you can expect V to be predictably smaller than U.
The above description is when U is meant to be an estimate of V. A similar effect can be seen when U is only meant to be correlated with V by looking at percentiles. When a sample is chosen which is a typical member of the top p percent of all Uvalues, it will have a lower V value than a typical member of the top p percent of all V values. As a special case, when you select the highest U value, you will often not select the highest V value.
Examples
Regressional Goodhart happens every time someone does something that is anything other than precisely the thing that maximizes their goal.
Regression to the Mean, Winner’s Curse, Optimizer’s Curse, Tails Come Apart phenomenon.
Relationship with Other Goodhart Phenomena
Regressional Goodhart is by far the most benign of the four Goodhart effects. It is also the hardest to avoid, as it shows up every time the proxy and the goal are not exactly the same.
Mitigation
When facing only Regressional Goodhart, choose the option with the largest proxy value. It’ll still be an overestimate, but will be better in expectation than options with a smaller proxy value. If possible, choose proxies more tightly correlated with your goal.
If you are not just trying to pick the best option, but also trying to have an accurate picture of what the true value will be, Regressional Goodhart may cause you to overestimate the value. If you know the exact relationship between the proxy and the goal, you can account for this by just calculating the expected goal value for a given proxy value. If you have access to a second proxy with an error independent from the error in the first proxy, you can use the first proxy to optimize, and the second proxy to get an accurate expectation of the true value. (This is what happens when you set aside some training data to use for testing.)
Causal Goodhart
When there is a non-causal correlation between the proxy and the goal, intervening on the proxy may fail to intervene on the goal.
Abstract Model
If V causes U (or if V and U are both caused by some third thing), then a correlation between V and U may be observed. However, when you intervene to increase Uthrough some mechanism that does not involve V, you will fail to also increase V.
Examples
Humans often avoid naive Causal Goodhart errors, and most examples I can think of sound obnoxious (like eating caviar to become rich). Possible example is a human who avoids doctor visits because not being told about bad health is a proxy for being healthy. (I do not know enough about humans to know if Causal Goodhart is actually what is going on here.)
I also cannot think of a good AI example. Most AI is not in acting in the kind of environment where Causal Goodhart would be a problem, and when it is acting in that kind of environment Causal Goodhart errors are easily avoided.
Most of the time the phrase “Correlation does not imply causation” is used it is pointing out that a proposed policy might be subject to Causal Goodhart.
Relationship with Other Goodhart Phenomena
You can tell the difference between Causal Goodhart and the other three types because Causal Goodhart goes away when just sample a world with large proxy value, rather than intervene to cause the proxy to happen.
Mitigation
One way to avoid Causal Goodhart is to only sample from or choose between worlds according to their proxy values, rather than causing the proxy. This clearly cannot be done in all situations, but it is useful to note that there is a class of problems for which Causal Goodhart cannot cause problems. For example, consider choosing between algorithms based on how well they do on some test inputs, and your goal is to choose an algorithm that performs well on random inputs. The fact that you choose an algorithm does not effect its performance, and you don’t have to worry about Causal Goodhart.
In cases where you actually change the proxy value, you can try to infer the causal structure of the variables using statistical methods, and check that the proxy actually causes the goal before you intervene on the proxy.
Extremal Goodhart
Worlds in which the proxy takes an extreme value may be very different from the ordinary worlds in which the correlation between the proxy and the goal was observed.
Abstract Model
Patterns tend to break at simple joints. One simple subset of worlds is those worlds in which U is very large. Thus, a strong correlation between U and V observed for naturally occuring U values may not transfer to worlds in which U is very large. Further, since there may be relatively few naturally occuring worlds in which U is very large, extremely large U may coincide with small V values without breaking the statistical correlation.
Examples
Humans evolve to like sugars, because sugars were correlated in the ancestral environment (which has fewer sugars) with nutrition and survival. Humans then optimize for sugars, have way too much, and become less healthy.
As an abstract mathematical example, let U and V be two correlated dimensions in a multivariate normal distribution, but we cut off the normal distribution to only include the ball of points in which U2+V2<n for some large n. This example represents a correlation between U and V in naturally occurring points, but also a boundary around what types of points are feasible that need not respect this correlation. Imagine you were to sample k points and take the one with the largest Uvalue. As you increase k, at first, this optimization pressure lets you find better and better points for both U and V, but as you increase k to infinity, eventually you sample so many points that you will find a point near U=n,V=0. When enough optimization pressure was applied, the correlation between U and V stopped mattering, and instead the boundary of what kinds of points were possible at all decided what kind of point was selected.
Many examples of machine learning algorithms doing bad because of overfitting are a special case of Extremal Goodhart.
Relationship with Other Goodhart Phenomena
Extremal Goodhart differs from Regressional Goodhart in that Extremal Goodhart goes away in simple examples like correlated dimensions in a multivariate normal distribution, but Regressional Goodhart does not.
Mitigation
Quantilization and Regularization are both useful for mitigating Extremal Goodhart effects. In general, Extremal Goodhart can be mitigated by choosing an option with a high proxy value, but not so high as to take you to a domain drastically different from the one in which the proxy was learned.
Adversarial Goodhart
When you optimize for a proxy, you provide an incentive for adversaries to correlate their goal with your proxy, thus destroying the correlation with your goal.
Abstract Model
Consider an agent A with some different goal W. Since they depend on common resources, W and V are naturally opposed. If you optimize U as a proxy for V, and Aknows this, A is incentivized to make large U values coincide with large W values, thus stopping them from coinciding with large V values.
Examples
When you use a metric to choose between people, but then those people learn what metric you use and game that metric, this is an example of Adversarial Goodhart.
Adversarial Goodhart is the mechanism behind a superintelligent AI making a Treacherous Turn. Here, V is doing what the humans want forever. U is doing what the humans want in the training cases where the AI does not have enough power to take over, and W is whatever the AI wants to do with the universe.
Adversarial Goodhart is also behind the malignancy of the universal prior, where you want to predict well forever (V), so hypotheses might predict well for a while (U), so that they can manipulate the world with their future predictions (W).
Relationship with Other Goodhart Phenomena
Adversarial Goodhart is the primary mechanism behind the original Goodhart’s Law.
Extremal Goodhart can happen even without any adversaries in the environment. However, Adversarial Goodhart may take advantage of Extremal Goodhart, as an adversary can more easily manipulate a small number of worlds with extreme proxy values, than it can manipulate all of the worlds.
Mitigation
Succesfully avoiding Adversarial Goodhart problems is very difficult in theory, and we understand very little about how to do this. In the case of non-superintelligent adversaries, you may be able to avoid Adversarial Goodhart by keeping your proxies secret (for example, not telling your employees what metrics you are using to evaluate them). However, this is unlikely to scale to dealing with superintelligent adversaries.
One technique that might help in mitigating Adversarial Goodhart is to choose a proxy that is so simple and optimize so hard that adversaries have no or minimal control over the world which maximizes that proxy. (I want to ephasize that this is not a good plan for avoiding Adversarial Goodhart; it is just all I have.)
For example, say you have a complicated goal that includes wanting to go to Mars. If you use a complicated search process to find a plan that is likely to get you to Mars, adversaries in your search process may suggest a plan that involves building a superintelligence that gets you to Mars, but also kills you.
On the other hand, if you use the proxy of getting to Mars as fast as possible and optimize very hard, then (maybe) adversaries can’t add baggage to a proposed plan without being out selected by a plan without that baggage. Buliding a superintelligence maybe takes more time than just having the plan tell you how to build a rocket quickly. (Note that the plan will likely include things like acceleration that humans can’t handle and nanobots that don’t turn off, so Extremal Goodhart will still kill you.)
What links here?
- Zac Hatfield-Dodds's comment on There Should Be More Alignment-Driven Startups by Vaniver (May 31, 2024, 9:31 AM; 5 points)
- Scott Garrabrant Jan 8, 2018, 3:05 PM
  LW: 13 AF: 6
  0
  AF Parent
  I am very happy you did this!
  I added a Quick Reference Section which contains your outline. I suspect your other changes are good too, but I dont want to copy them in without checking to make sure you didnt change something improtant. (Maybe it would be good if you had some way to communicate the difference or the most improtant changes quickly.)
  I also changed the causal basketball example.
  On a meta note, I wonder how we can build a system of collaboration more directly into Less Wrong. I think this would be very useful. (I may be biased as someone who has an unusually high gap between ability to generate good ideas and ability to write.)
  - Raemon Jan 8, 2018, 10:27 PM
    LW: 2 AF: 1
    0
    AF Parent
    I actually didn’t make many other changes (originally I was planning to rewrite large chunks of it to reflect my own understanding. Instead, the primary thing ended up being “what happens when I simply convert a post with 18px font into a comment with 13px font). I trimmed out a few words that seemed excessive, but this was more an exercise in “what if LW posts looked more like comments?” or something.
    That said, if you think it’d be useful I’d be up for making another more serious attempt to trim it down and/or make it more readable—this is something I could imagine turning out to be a valuable thing for me to spend time on on a regular basis.

Raemon comments on Goodhart Taxonomy

Goodhart Taxonomy

Regressional Goodhart

Causal Goodhart

Extremal Goodhart

Adversarial Goodhart