Goodhart Taxonomy
Goodhart’s Law states that “any observed statistical regularity will tend to collapse once pressure is placed upon it for control purposes.” However, this is not a single phenomenon. I propose that there are (at least) four different mechanisms through which proxy measures break when you optimize for them.
The four types are Regressional, Causal, Extremal, and Adversarial. In this post, I will go into detail about these four different Goodhart effects using mathematical abstractions as well as examples involving humans and/or AI. I will also talk about how you can mitigate each effect.
Throughout the post, I will use to refer to the true goal and use to refer to a proxy for that goal which was observed to correlate with and which is being optimized in some way.
Quick Reference
Regressional Goodhart—When selecting for a proxy measure, you select not only for the true goal, but also for the difference between the proxy and the goal.
Model: When is equal to , where is some noise, a point with a large value will likely have a large value, but also a large value. Thus, when is large, you can expect to be predictably smaller than .
Example: height is correlated with basketball ability, and does actually directly help, but the best player is only 6′3″, and a random 7′ person in their 20s would probably not be as good
Causal Goodhart—When there is a non-causal correlation between the proxy and the goal, intervening on the proxy may fail to intervene on the goal.
Model: If causes (or if and are both caused by some third thing), then a correlation between and may be observed. However, when you intervene to increase through some mechanism that does not involve , you will fail to also increase .
Example: someone who wishes to be taller might observe that height is correlated with basketball skill and decide to start practicing basketball.
Extremal Goodhart—Worlds in which the proxy takes an extreme value may be very different from the ordinary worlds in which the correlation between the proxy and the goal was observed.
Model: Patterns tend to break at simple joints. One simple subset of worlds is those worlds in which is very large. Thus, a strong correlation between and observed for naturally occuring values may not transfer to worlds in which is very large. Further, since there may be relatively few naturally occuring worlds in which is very large, extremely large may coincide with small values without breaking the statistical correlation.
Example: the tallest person on record, Robert Wadlow, was 8′11″ (2.72m). He grew to that height because of a pituitary disorder, he would have struggled to play basketball because he “required leg braces to walk and had little feeling in his legs and feet.”
Adversarial Goodhart—When you optimize for a proxy, you provide an incentive for adversaries to correlate their goal with your proxy, thus destroying the correlation with your goal.
Model: Consider an agent with some different goal . Since they depend on common resources, and are naturally opposed. If you optimize as a proxy for , and knows this, is incentivized to make large values coincide with large values, thus stopping them from coinciding with large values.
Example: aspiring NBA players might just lie about their height.
Regressional Goodhart
When selecting for a proxy measure, you select not only for the true goal, but also for the difference between the proxy and the goal.
Abstract Model
When is equal to , where is some noise, a point with a large value will likely have a large value, but also a large value. Thus, when is large, you can expect to be predictably smaller than .
The above description is when is meant to be an estimate of . A similar effect can be seen when is only meant to be correlated with by looking at percentiles. When a sample is chosen which is a typical member of the top percent of all values, it will have a lower value than a typical member of the top percent of all values. As a special case, when you select the highest value, you will often not select the highest value.
Examples
Examples of Regressional Goodhart are everywhere. Every time someone does something that is anything other than the thing that maximizes their goal, you could view it as them optimizing some kind of proxy (and the action to maximize the proxy is not the same as the action to maximize the goal).
Regression to the Mean, Winner’s Curse, and Optimizer’s Curse are all examples of Regressional Goodhart, as is the Tails Come Apart phenomenon.
Relationship with Other Goodhart Phenomena
Regressional Goodhart is by far the most benign of the four Goodhart effects. It is also the hardest to avoid, as it shows up every time the proxy and the goal are not exactly the same.
Mitigation
When facing only Regressional Goodhart, you still want to choose the option with the largest proxy value. While the proxy will be an overestimate it will still be better in expectation than options with a smaller proxy value. If you have control over what proxies to use, you can mitigate Regressional Goodhart by choosing proxies that are more tightly correlated with your goal.
If you are not just trying to pick the best option, but also trying to have an accurate picture of what the true value will be, Regressional Goodhart may cause you to overestimate the value. If you know the exact relationship between the proxy and the goal, you can account for this by just calculating the expected goal value for a given proxy value. If you have access to a second proxy with an error independent from the error in the first proxy, you can use the first proxy to optimize, and the second proxy to get an accurate expectation of the true value. (This is what happens when you set aside some training data to use for testing.)
Causal Goodhart
When there is a non-causal correlation between the proxy and the goal, intervening on the proxy may fail to intervene on the goal.
Abstract Model
If causes (or if and are both caused by some third thing), then a correlation between and may be observed. However, when you intervene to increase through some mechanism that does not involve , you will fail to also increase V.
Examples
Humans often avoid naive Causal Goodhart errors, and most examples I can think of sound obnoxious (like eating caviar to become rich). One possible example is a human who avoids doctor visits because not being told about health is a proxy for being healthy. (I do not know enough about humans to know if Causal Goodhart is actually what is going on here.)
I also cannot think of a good AI example. Most AI is not in acting in the kind of environment where Causal Goodhart would be a problem, and when it is acting in that kind of environment Causal Goodhart errors are easily avoided.
Most of the time the phrase “Correlation does not imply causation” is used it is pointing out that a proposed policy might be subject to Causal Goodhart.
Relationship with Other Goodhart Phenomena
You can tell the difference between Causal Goodhart and the other three types because Causal Goodhart goes away when just sample a world with large proxy value, rather than intervene to cause the proxy to happen.
Mitigation
One way to avoid Causal Goodhart is to only sample from or choose between worlds according to their proxy values, rather than causing the proxy. This clearly cannot be done in all situations, but it is useful to note that there is a class of problems for which Causal Goodhart cannot cause problems. For example, consider choosing between algorithms based on how well they do on some test inputs, and your goal is to choose an algorithm that performs well on random inputs. The fact that you choose an algorithm does not effect its performance, and you don’t have to worry about Causal Goodhart.
In cases where you actually change the proxy value, you can try to infer the causal structure of the variables using statistical methods, and check that the proxy actually causes the goal before you intervene on the proxy.
Extremal Goodhart
Worlds in which the proxy takes an extreme value may be very different from the ordinary worlds in which the correlation between the proxy and the goal was observed.
Abstract Model
Patterns tend to break at simple joints. One simple subset of worlds is those worlds in which is very large. Thus, a strong correlation between and observed for naturally occuring values may not transfer to worlds in which is very large. Further, since there may be relatively few naturally occuring worlds in which is very large, extremely large may coincide with small values without breaking the statistical correlation.
Examples
Humans evolve to like sugars, because sugars were correlated in the ancestral environment (which has fewer sugars) with nutrition and survival. Humans then optimize for sugars, have way too much, and become less healthy.
As an abstract mathematical example, let and be two correlated dimensions in a multivariate normal distribution, but we cut off the normal distribution to only include the ball of points in which for some large . This example represents a correlation between and in naturally occurring points, but also a boundary around what types of points are feasible that need not respect this correlation. Imagine you were to sample points and take the one with the largest value. As you increase , at first, this optimization pressure lets you find better and better points for both and , but as you increase to infinity, eventually you sample so many points that you will find a point near . When enough optimization pressure was applied, the correlation between and stopped mattering, and instead the boundary of what kinds of points were possible at all decided what kind of point was selected.
Many examples of machine learning algorithms doing bad because of overfitting are a special case of Extremal Goodhart.
Relationship with Other Goodhart Phenomena
Extremal Goodhart differs from Regressional Goodhart in that Extremal Goodhart goes away in simple examples like correlated dimensions in a multivariate normal distribution, but Regressional Goodhart does not.
Mitigation
Quantilization and Regularization are both useful for mitigating Extremal Goodhart effects. In general, Extremal Goodhart can be mitigated by choosing an option with a high proxy value, but not so high as to take you to a domain drastically different from the one in which the proxy was learned.
Adversarial Goodhart
When you optimize for a proxy, you provide an incentive for adversaries to correlate their goal with your proxy, thus destroying the correlation with your goal.
Abstract Model
Consider an agent with some different goal . Since they depend on common resources, and are naturally opposed. If you optimize as a proxy for , and knows this, is incentivized to make large values coincide with large values, thus stopping them from coinciding with large values.
Examples
When you use a metric to choose between people, but then those people learn what metric you use and game that metric, this is an example of Adversarial Goodhart.
Adversarial Goodhart is the mechanism behind a superintelligent AI making a Treacherous Turn. Here, is doing what the humans want forever. is doing what the humans want in the training cases where the AI does not have enough power to take over, and is whatever the AI wants to do with the universe.
Adversarial Goodhart is also behind the malignancy of the universal prior, where you want to predict well forever (), so hypotheses might predict well for a while (), so that they can manipulate the world with their future predictions ().
Relationship with Other Goodhart Phenomena
Adversarial Goodhart is the primary mechanism behind the original Goodhart’s Law.
Extremal Goodhart can happen even without any adversaries in the environment. However, Adversarial Goodhart may take advantage of Extremal Goodhart, as an adversary can more easily manipulate a small number of worlds with extreme proxy values, than it can manipulate all of the worlds.
Mitigation
Succesfully avoiding Adversarial Goodhart problems is very difficult in theory, and we understand very little about how to do this. In the case of non-superintelligent adversaries, you may be able to avoid Adversarial Goodhart by keeping your proxies secret (for example, not telling your employees what metrics you are using to evaluate them). However, this is unlikely to scale to dealing with superintelligent adversaries.
One technique that might help in mitigating Adversarial Goodhart is to choose a proxy that is so simple and optimize so hard that adversaries have no or minimal control over the world which maximizes that proxy. (I want to ephasize that this is not a good plan for avoiding Adversarial Goodhart; it is just all I have.)
For example, say you have a complicated goal that includes wanting to go to Mars. If you use a complicated search process to find a plan that is likely to get you to Mars, adversaries in your search process may suggest a plan that involves building a superintelligence that gets you to Mars, but also kills you.
On the other hand, if you use the proxy of getting to Mars as fast as possible and optimize very hard, then (maybe) adversaries can’t add baggage to a proposed plan without being out selected by a plan without that baggage. Buliding a superintelligence maybe takes more time than just having the plan tell you how to build a rocket quickly. (Note that the plan will likely include things like acceleration that humans can’t handle and nanobots that don’t turn off, so Extremal Goodhart will still kill you.)
- Building up to an Internal Family Systems model by 26 Jan 2019 12:25 UTC; 281 points) (
- Recursive Middle Manager Hell by 1 Jan 2023 4:33 UTC; 219 points) (
- Zvi’s Thoughts on the Survival and Flourishing Fund (SFF) by 14 Dec 2021 14:30 UTC; 193 points) (
- Zvi’s Thoughts on the Survival and Flourishing Fund (SFF) by 14 Dec 2021 14:30 UTC; 192 points) (
- When is Goodhart catastrophic? by 9 May 2023 3:59 UTC; 177 points) (
- Seeking Power is Often Convergently Instrumental in MDPs by 5 Dec 2019 2:33 UTC; 162 points) (
- AI Alignment 2018-19 Review by 28 Jan 2020 2:19 UTC; 126 points) (
- Say Wrong Things by 24 May 2019 22:11 UTC; 115 points) (
- Criticism of EA Criticism Contest by 14 Jul 2022 14:30 UTC; 108 points) (
- Criticism of EA Criticism Contest by 14 Jul 2022 14:20 UTC; 101 points) (EA Forum;
- Craving, suffering, and predictive processing (three characteristics series) by 15 May 2020 13:21 UTC; 90 points) (
- What are the coolest topics in AI safety, to a hopelessly pure mathematician? by 7 May 2022 7:18 UTC; 89 points) (EA Forum;
- Automating Auditing: An ambitious concrete technical research proposal by 11 Aug 2021 20:32 UTC; 88 points) (
- Your Prioritization is Underspecified by 10 Jul 2020 20:48 UTC; 85 points) (
- Zvi’s Thoughts on the Survival and Flourishing Fund (SFF) by 15 Dec 2021 2:44 UTC; 81 points) (EA Forum;
- Announcement: AI alignment prize winners and next round by 15 Jan 2018 14:33 UTC; 81 points) (
- How Doomed are Large Organizations? by 21 Jan 2020 12:20 UTC; 81 points) (
- Long-Term Future Fund: August 2019 grant recommendations by 3 Oct 2019 18:46 UTC; 79 points) (EA Forum;
- Model splintering: moving from one imperfect model to another by 27 Aug 2020 11:53 UTC; 79 points) (
- Compounding Resource X by 11 Jan 2023 3:14 UTC; 77 points) (
- Moral Mazes and Short Termism by 2 Jun 2019 11:30 UTC; 74 points) (
- Recursive Middle Manager Hell by 17 Jan 2023 19:02 UTC; 73 points) (EA Forum;
- Classifying specification problems as variants of Goodhart’s Law by 19 Aug 2019 20:40 UTC; 72 points) (
- Environmental Structure Can Cause Instrumental Convergence by 22 Jun 2021 22:26 UTC; 71 points) (
- Alignment can improve generalisation through more robustly doing what a human wants—CoinRun example by 21 Nov 2023 11:41 UTC; 68 points) (
- AI Neorealism: a threat model & success criterion for existential safety by 15 Dec 2022 13:42 UTC; 67 points) (
- Theoretical Neuroscience For Alignment Theory by 7 Dec 2021 21:50 UTC; 65 points) (
- How to Identify an Immoral Maze by 12 Jan 2020 12:10 UTC; 64 points) (
- Thinking about maximization and corrigibility by 21 Apr 2023 21:22 UTC; 63 points) (
- The Pointers Problem: Clarifications/Variations by 5 Jan 2021 17:29 UTC; 61 points) (
- MIRI’s 2018 Fundraiser by 27 Nov 2018 5:30 UTC; 60 points) (
- Schelling Categories, and Simple Membership Tests by 26 Aug 2019 2:43 UTC; 59 points) (
- Simple Rules of Law by 19 May 2019 0:10 UTC; 56 points) (
- What is ambitious value learning? by 1 Nov 2018 16:20 UTC; 55 points) (
- Don’t Over-Optimize Things by 16 Jun 2022 16:28 UTC; 53 points) (EA Forum;
- Conclusion to the sequence on value learning by 3 Feb 2019 21:05 UTC; 51 points) (
- On OpenAI’s Preparedness Framework by 21 Dec 2023 14:00 UTC; 51 points) (
- The new dot com bubble is here: it’s called online advertising by 18 Nov 2019 22:05 UTC; 50 points) (
- Sunday October 25, 12:00PM (PT) — Scott Garrabrant on “Cartesian Frames” by 21 Oct 2020 3:27 UTC; 48 points) (
- Introduction to Reducing Goodhart by 26 Aug 2021 18:38 UTC; 48 points) (
- Does Bayes Beat Goodhart? by 3 Jun 2019 2:31 UTC; 48 points) (
- AI #18: The Great Debate Debate by 29 Jun 2023 16:20 UTC; 47 points) (
- The Catastrophic Convergence Conjecture by 14 Feb 2020 21:16 UTC; 45 points) (
- Goodhart Taxonomy: Agreement by 1 Jul 2018 3:50 UTC; 44 points) (
- Using expected utility for Good(hart) by 27 Aug 2018 3:32 UTC; 42 points) (
- Finding the Wisdom to Build Safe AI by 4 Jul 2024 19:04 UTC; 36 points) (
- Long-Term Future Fund: August 2019 grant recommendations by 3 Oct 2019 20:41 UTC; 35 points) (
- If I were a well-intentioned AI… I: Image classifier by 26 Feb 2020 12:39 UTC; 35 points) (
- What is a definition, how can it be extrapolated? by 14 Mar 2023 18:08 UTC; 34 points) (
- On Robin Hanson’s “Social Proof, but of What?” by 20 Dec 2020 22:20 UTC; 33 points) (
- 10 Mar 2019 11:08 UTC; 32 points) 's comment on Getting People Excited About More EA Careers: A New Community Building Challenge by (EA Forum;
- Re-introducing Selection vs Control for Optimization (Optimizing and Goodhart Effects—Clarifying Thoughts, Part 1) by 2 Jul 2019 15:36 UTC; 31 points) (
- 24 Nov 2019 19:46 UTC; 31 points) 's comment on The LessWrong 2018 Review by (
- 9 Feb 2019 21:02 UTC; 30 points) 's comment on The Case for a Bigger Audience by (
- “How conservative” should the partial maximisers be? by 13 Apr 2020 15:50 UTC; 30 points) (
- AI Safety 101 : Reward Misspecification by 18 Oct 2023 20:39 UTC; 30 points) (
- How to write good AI forecasting questions + Question Database (Forecasting infrastructure, part 3) by 3 Sep 2019 14:50 UTC; 29 points) (
- Diagnosing EA Research- Are stakeholder-engaged methods the solution? by 27 Jan 2023 14:38 UTC; 28 points) (EA Forum;
- AI Alignment Prize: Round 2 due March 31, 2018 by 12 Mar 2018 12:10 UTC; 28 points) (
- Don’t Over-Optimize Things by 16 Jun 2022 16:33 UTC; 27 points) (
- Goodhart’s Curse and Limitations on AI Alignment by 19 Aug 2019 7:57 UTC; 25 points) (
- Outer alignment and imitative amplification by 10 Jan 2020 0:26 UTC; 24 points) (
- AI Forecasting Question Database (Forecasting infrastructure, part 3) by 3 Sep 2019 14:57 UTC; 23 points) (EA Forum;
- Models Modeling Models by 2 Nov 2021 7:08 UTC; 23 points) (
- Non-Adversarial Goodhart and AI Risks by 27 Mar 2018 1:39 UTC; 22 points) (
- MIRI’s 2018 Fundraiser by 27 Nov 2018 6:22 UTC; 20 points) (EA Forum;
- The reverse Goodhart problem by 8 Jun 2021 15:48 UTC; 20 points) (
- AGI Alignment Should Solve Corporate Alignment by 27 Dec 2020 2:23 UTC; 20 points) (
- (Some?) Possible Multi-Agent Goodhart Interactions by 22 Sep 2018 17:48 UTC; 20 points) (
- Stable Pointers to Value III: Recursive Quantilization by 21 Jul 2018 8:06 UTC; 20 points) (
- A world in which the alignment problem seems lower-stakes by 8 Jul 2021 2:31 UTC; 20 points) (
- My decomposition of the alignment problem by 2 Sep 2024 0:21 UTC; 20 points) (
- Goodhart Ethology by 17 Sep 2021 17:31 UTC; 20 points) (
- Alignment Newsletter #44 by 6 Feb 2019 8:30 UTC; 18 points) (
- Alignment Newsletter #32 by 12 Nov 2018 17:20 UTC; 18 points) (
- Alignment Newsletter #31 by 5 Nov 2018 23:50 UTC; 17 points) (
- Approximately Bayesian Reasoning: Knightian Uncertainty, Goodhart, and the Look-Elsewhere Effect by 26 Jan 2024 3:58 UTC; 16 points) (
- Framing approaches to alignment and the hard problem of AI cognition by 15 Dec 2021 19:06 UTC; 16 points) (
- 15 Dec 2021 18:24 UTC; 15 points) 's comment on Zvi’s Thoughts on the Survival and Flourishing Fund (SFF) by (EA Forum;
- Mapping the Conceptual Territory in AI Existential Safety and Alignment by 12 Feb 2021 7:55 UTC; 15 points) (
- 24 Oct 2019 1:06 UTC; 15 points) 's comment on Artificial general intelligence is here, and it’s useless by (
- 7 May 2023 7:13 UTC; 15 points) 's comment on How much do you believe your results? by (
- Quantilal control for finite MDPs by 12 Apr 2018 9:21 UTC; 14 points) (
- [AN #76]: How dataset size affects robustness, and benchmarking safe exploration by measuring constraint violations by 4 Dec 2019 18:10 UTC; 14 points) (
- 9 Apr 2018 9:29 UTC; 14 points) 's comment on Is Rhetoric Worth Learning? by (
- 8 Mar 2018 20:08 UTC; 14 points) 's comment on On the Loss and Preservation of Knowledge by (
- Lotuses and Loot Boxes by 17 May 2018 0:21 UTC; 14 points) (
- Truthfulness, standards and credibility by 7 Apr 2022 10:31 UTC; 12 points) (
- Evaluating Existing Approaches to AGI Alignment by 27 Mar 2018 19:57 UTC; 12 points) (
- Towards deconfusing values by 29 Jan 2020 19:28 UTC; 12 points) (
- 29 Dec 2017 7:34 UTC; 12 points) 's comment on Comments on Power Law Distribution of Individual Impact by (
- 5 Aug 2019 17:44 UTC; 11 points) 's comment on Sunny’s Shortform by (
- 26 Feb 2018 1:54 UTC; 11 points) 's comment on Meta-tations on Moderation: Towards Public Archipelago by (
- 18 Nov 2023 10:57 UTC; 11 points) 's comment on On the lethality of biased human reward ratings by (
- 18 Mar 2021 7:56 UTC; 10 points) 's comment on Relative Impact of the First 10 EA Forum Prize Winners by (EA Forum;
- 14 May 2020 21:41 UTC; 9 points) 's comment on Tips/tricks/notes on optimizing investments by (
- Some Comments on “Goodhart Taxonomy” by 24 Dec 2019 23:59 UTC; 9 points) (
- 3 Nov 2019 22:00 UTC; 9 points) 's comment on But exactly how complex and fragile? by (
- The Three Levels of Goodhart’s Curse by 30 Dec 2017 16:41 UTC; 7 points) (
- 20 Nov 2022 7:26 UTC; 7 points) 's comment on Don’t design agents which exploit adversarial inputs by (
- 17 Jan 2020 17:52 UTC; 6 points) 's comment on Growth and the case against randomista development by (EA Forum;
- AI Safety 101 : Reward Misspecification by 21 Dec 2023 14:26 UTC; 6 points) (EA Forum;
- 30 Dec 2017 16:49 UTC; 6 points) 's comment on Announcing the AI Alignment Prize by (
- 21 Nov 2023 14:43 UTC; 6 points) 's comment on Alignment can improve generalisation through more robustly doing what a human wants—CoinRun example by (
- 8 Oct 2019 17:58 UTC; 6 points) 's comment on AI Alignment Open Thread October 2019 by (
- 17 Sep 2021 18:58 UTC; 6 points) 's comment on Goodhart Ethology by (
- 3 Jun 2019 3:40 UTC; 5 points) 's comment on Does Bayes Beat Goodhart? by (
- 1 Nov 2018 6:21 UTC; 5 points) 's comment on Goodhart’s Law and Genies by (
- 14 Mar 2023 23:43 UTC; 4 points) 's comment on What are some ideas that LessWrong has reinvented? by (
- 24 Nov 2019 21:29 UTC; 4 points) 's comment on The LessWrong 2018 Review by (
- 26 Oct 2019 1:46 UTC; 4 points) 's comment on All I know is Goodhart by (
- 7 May 2023 7:14 UTC; 3 points) 's comment on How much do you believe your results? by (EA Forum;
- 19 Oct 2022 6:41 UTC; 3 points) 's comment on Empowering Others To Find The Light of The EA Mindset by (EA Forum;
- 31 Mar 2018 1:51 UTC; 3 points) 's comment on Reward hacking and Goodhart’s law by evolutionary algorithms by (
- 14 Apr 2021 11:28 UTC; 3 points) 's comment on Wanting to Succeed on Every Metric Presented by (
- 23 Jan 2020 7:34 UTC; 3 points) 's comment on Siren worlds and the perils of over-optimised search by (
- 4 Oct 2022 8:27 UTC; 2 points) 's comment on An opportunity to test Improving institutional decision-making by (EA Forum;
- 1 Aug 2023 2:10 UTC; 2 points) 's comment on Framings of Deceptive Alignment by (
- Housing Markets, Satisficers, and One-Track Goodhart by 16 Dec 2021 21:38 UTC; 2 points) (
- 7 Apr 2020 19:24 UTC; 2 points) 's comment on Core Tag Examples [temporary] by (
- 29 Nov 2022 22:19 UTC; 2 points) 's comment on Alignment allows “nonrobust” decision-influences and doesn’t require robust grading by (
- I think I came up with a good utility function for AI that seems too obvious. Can you people poke holes in it? by 28 Aug 2019 22:33 UTC; 1 point) (
- 24 Nov 2022 14:35 UTC; 1 point) 's comment on Clarifying wireheading terminology by (
- 31 Dec 2020 14:31 UTC; 1 point) 's comment on Collider bias as a cognitive blindspot? by (
I liked this post but wished it was short enough to store it all in my working memory. Partly because of the site formatting, partly because I think it was written as if it were an essay instead of a short reference post (which seems reasonable for the OP), I found it hard to scroll through without losing my train of thought.
I thought I’d try shortening it slightly and see if I could make it easier to parse. (Also collating various examples people came up with)
...
...
Goodhart Taxonomy
Goodhart’s Law states that “any observed statistical regularity will tend to collapse once pressure is placed upon it for control purposes.” However, this is not a single phenomenon. I propose that there are (at least) four different mechanisms through which proxy measures break when you optimize for them:
Regressional—When selecting for a proxy measure, you select not only for the true goal, but also for the difference between the proxy and the goal.
Model: When U is equal to V+X, where X is some noise, a point with a large U value will likely have a large V value, but also a large X value. Thus, when U is large, you can expect V to be predictably smaller than U.
Example: height is correlated with basketball ability, and does actually directly help, but the best player is only 6′3″, and a random 7′ person in their 20s would probably not be as good
Causal - When there is a non-causal correlation between the proxy and the goal, intervening on the proxy may fail to intervene on the goal.
Model: If V causes U (or if V and U are both caused by some third thing), then a correlation between V and U may be observed. However, when you intervene to increase U through some mechanism that does not involve V, you will fail to also increase V.
Example: an early 1900s college basketball team gets all of their players high-heeled shoes, because tallness causes people to be better at basketball. Instead, the players are slowed and get more foot injuries.
Extremal—Worlds in which the proxy takes an extreme value may be very different from the ordinary worlds in which the correlation between the proxy and the goal was observed.
Model: Patterns tend to break at simple joints. One simple subset of worlds is those worlds in which U is very large. Thus, a strong correlation between U and V observed for naturally occuring U values may not transfer to worlds in which U is very large. Further, since there may be relatively few naturally occuring worlds in which U is very large, extremely large U may coincide with small V values without breaking the statistical correlation.
Example: the tallest person on record, Robert Wadlow, was 8′11″ (2.72m). He grew to that height because of a pituitary disorder, he would have struggled to play basketball because he “required leg braces to walk and had little feeling in his legs and feet.”
Adversarial—When you optimize for a proxy, you provide an incentive for adversaries to correlate their goal with your proxy, thus destroying the correlation with your goal.
Model: Consider an agent A with some different goal W. Since they depend on common resources, W and V are naturally opposed. If you optimize U as a proxy for V, and Aknows this, A is incentivized to make large U values coincide with large W values, thus stopping them from coinciding with large V values.
Example: aspiring NBA players might just lie about their height.
[note: I think most of the value of this came from the above list, but am curious if people find the rest of the post below easier to parse]
Regressional Goodhart
When selecting for a proxy measure, you select not only for the true goal, but also for the difference between the proxy and the goal.
Abstract Model
When U is equal to V+X, where X is some noise, a point with a large U value will likely have a large V value, but also a large X value. Thus, when U is large, you can expect V to be predictably smaller than U.
The above description is when U is meant to be an estimate of V. A similar effect can be seen when U is only meant to be correlated with V by looking at percentiles. When a sample is chosen which is a typical member of the top p percent of all Uvalues, it will have a lower V value than a typical member of the top p percent of all V values. As a special case, when you select the highest U value, you will often not select the highest V value.
Examples
Regressional Goodhart happens every time someone does something that is anything other than precisely the thing that maximizes their goal.
Regression to the Mean, Winner’s Curse, Optimizer’s Curse, Tails Come Apart phenomenon.
Relationship with Other Goodhart Phenomena
Regressional Goodhart is by far the most benign of the four Goodhart effects. It is also the hardest to avoid, as it shows up every time the proxy and the goal are not exactly the same.
Mitigation
When facing only Regressional Goodhart, choose the option with the largest proxy value. It’ll still be an overestimate, but will be better in expectation than options with a smaller proxy value. If possible, choose proxies more tightly correlated with your goal.
If you are not just trying to pick the best option, but also trying to have an accurate picture of what the true value will be, Regressional Goodhart may cause you to overestimate the value. If you know the exact relationship between the proxy and the goal, you can account for this by just calculating the expected goal value for a given proxy value. If you have access to a second proxy with an error independent from the error in the first proxy, you can use the first proxy to optimize, and the second proxy to get an accurate expectation of the true value. (This is what happens when you set aside some training data to use for testing.)
Causal Goodhart
When there is a non-causal correlation between the proxy and the goal, intervening on the proxy may fail to intervene on the goal.
Abstract Model
If V causes U (or if V and U are both caused by some third thing), then a correlation between V and U may be observed. However, when you intervene to increase Uthrough some mechanism that does not involve V, you will fail to also increase V.
Examples
Humans often avoid naive Causal Goodhart errors, and most examples I can think of sound obnoxious (like eating caviar to become rich). Possible example is a human who avoids doctor visits because not being told about bad health is a proxy for being healthy. (I do not know enough about humans to know if Causal Goodhart is actually what is going on here.)
I also cannot think of a good AI example. Most AI is not in acting in the kind of environment where Causal Goodhart would be a problem, and when it is acting in that kind of environment Causal Goodhart errors are easily avoided.
Most of the time the phrase “Correlation does not imply causation” is used it is pointing out that a proposed policy might be subject to Causal Goodhart.
Relationship with Other Goodhart Phenomena
You can tell the difference between Causal Goodhart and the other three types because Causal Goodhart goes away when just sample a world with large proxy value, rather than intervene to cause the proxy to happen.
Mitigation
One way to avoid Causal Goodhart is to only sample from or choose between worlds according to their proxy values, rather than causing the proxy. This clearly cannot be done in all situations, but it is useful to note that there is a class of problems for which Causal Goodhart cannot cause problems. For example, consider choosing between algorithms based on how well they do on some test inputs, and your goal is to choose an algorithm that performs well on random inputs. The fact that you choose an algorithm does not effect its performance, and you don’t have to worry about Causal Goodhart.
In cases where you actually change the proxy value, you can try to infer the causal structure of the variables using statistical methods, and check that the proxy actually causes the goal before you intervene on the proxy.
Extremal Goodhart
Worlds in which the proxy takes an extreme value may be very different from the ordinary worlds in which the correlation between the proxy and the goal was observed.
Abstract Model
Patterns tend to break at simple joints. One simple subset of worlds is those worlds in which U is very large. Thus, a strong correlation between U and V observed for naturally occuring U values may not transfer to worlds in which U is very large. Further, since there may be relatively few naturally occuring worlds in which U is very large, extremely large U may coincide with small V values without breaking the statistical correlation.
Examples
Humans evolve to like sugars, because sugars were correlated in the ancestral environment (which has fewer sugars) with nutrition and survival. Humans then optimize for sugars, have way too much, and become less healthy.
As an abstract mathematical example, let U and V be two correlated dimensions in a multivariate normal distribution, but we cut off the normal distribution to only include the ball of points in which U2+V2<n for some large n. This example represents a correlation between U and V in naturally occurring points, but also a boundary around what types of points are feasible that need not respect this correlation. Imagine you were to sample k points and take the one with the largest Uvalue. As you increase k, at first, this optimization pressure lets you find better and better points for both U and V, but as you increase k to infinity, eventually you sample so many points that you will find a point near U=n,V=0. When enough optimization pressure was applied, the correlation between U and V stopped mattering, and instead the boundary of what kinds of points were possible at all decided what kind of point was selected.
Many examples of machine learning algorithms doing bad because of overfitting are a special case of Extremal Goodhart.
Relationship with Other Goodhart Phenomena
Extremal Goodhart differs from Regressional Goodhart in that Extremal Goodhart goes away in simple examples like correlated dimensions in a multivariate normal distribution, but Regressional Goodhart does not.
Mitigation
Quantilization and Regularization are both useful for mitigating Extremal Goodhart effects. In general, Extremal Goodhart can be mitigated by choosing an option with a high proxy value, but not so high as to take you to a domain drastically different from the one in which the proxy was learned.
Adversarial Goodhart
When you optimize for a proxy, you provide an incentive for adversaries to correlate their goal with your proxy, thus destroying the correlation with your goal.
Abstract Model
Consider an agent A with some different goal W. Since they depend on common resources, W and V are naturally opposed. If you optimize U as a proxy for V, and Aknows this, A is incentivized to make large U values coincide with large W values, thus stopping them from coinciding with large V values.
Examples
When you use a metric to choose between people, but then those people learn what metric you use and game that metric, this is an example of Adversarial Goodhart.
Adversarial Goodhart is the mechanism behind a superintelligent AI making a Treacherous Turn. Here, V is doing what the humans want forever. U is doing what the humans want in the training cases where the AI does not have enough power to take over, and W is whatever the AI wants to do with the universe.
Adversarial Goodhart is also behind the malignancy of the universal prior, where you want to predict well forever (V), so hypotheses might predict well for a while (U), so that they can manipulate the world with their future predictions (W).
Relationship with Other Goodhart Phenomena
Adversarial Goodhart is the primary mechanism behind the original Goodhart’s Law.
Extremal Goodhart can happen even without any adversaries in the environment. However, Adversarial Goodhart may take advantage of Extremal Goodhart, as an adversary can more easily manipulate a small number of worlds with extreme proxy values, than it can manipulate all of the worlds.
Mitigation
Succesfully avoiding Adversarial Goodhart problems is very difficult in theory, and we understand very little about how to do this. In the case of non-superintelligent adversaries, you may be able to avoid Adversarial Goodhart by keeping your proxies secret (for example, not telling your employees what metrics you are using to evaluate them). However, this is unlikely to scale to dealing with superintelligent adversaries.
One technique that might help in mitigating Adversarial Goodhart is to choose a proxy that is so simple and optimize so hard that adversaries have no or minimal control over the world which maximizes that proxy. (I want to ephasize that this is not a good plan for avoiding Adversarial Goodhart; it is just all I have.)
For example, say you have a complicated goal that includes wanting to go to Mars. If you use a complicated search process to find a plan that is likely to get you to Mars, adversaries in your search process may suggest a plan that involves building a superintelligence that gets you to Mars, but also kills you.
On the other hand, if you use the proxy of getting to Mars as fast as possible and optimize very hard, then (maybe) adversaries can’t add baggage to a proposed plan without being out selected by a plan without that baggage. Buliding a superintelligence maybe takes more time than just having the plan tell you how to build a rocket quickly. (Note that the plan will likely include things like acceleration that humans can’t handle and nanobots that don’t turn off, so Extremal Goodhart will still kill you.)
I am very happy you did this!
I added a Quick Reference Section which contains your outline. I suspect your other changes are good too, but I dont want to copy them in without checking to make sure you didnt change something improtant. (Maybe it would be good if you had some way to communicate the difference or the most improtant changes quickly.)
I also changed the causal basketball example.
On a meta note, I wonder how we can build a system of collaboration more directly into Less Wrong. I think this would be very useful. (I may be biased as someone who has an unusually high gap between ability to generate good ideas and ability to write.)
I actually didn’t make many other changes (originally I was planning to rewrite large chunks of it to reflect my own understanding. Instead, the primary thing ended up being “what happens when I simply convert a post with 18px font into a comment with 13px font). I trimmed out a few words that seemed excessive, but this was more an exercise in “what if LW posts looked more like comments?” or something.
That said, if you think it’d be useful I’d be up for making another more serious attempt to trim it down and/or make it more readable—this is something I could imagine turning out to be a valuable thing for me to spend time on on a regular basis.
Note that this post has been turned into a paper, which expands on the ideas, and incorporates some more details.
(Scott—should you edit the post to link to the paper?)
I like this taxaonomy of an important concept, and expect it to become a common reference work in other writings (for me, at least). Secondarily, I also appreciated the structure, and how much the technical language was only used to make things clearer (to me at least) and not to needlessly obfuscate at all. Promoted to Featured.
Edit: Or, I will promote it to Featured once my button for promoting it works. Will ping Oli/Ray about this presently.
Added: Here is a recent comment where I would’ve liked to link to this to help explain something, and have now gone back and re-inserted it.
Thanks!
For an AI related Causal Goodhart example, what about Schmidhuber’s idea than an AI should maximise “complexity”? Since humans are the main cause of complexity (in the sense he was thinking of) in the current world, but would not be in an extreme world, this seems to fit.
Adversarial Goodhart is the only one that I’d say Goodhart may have intended, and I think they dynamics are more complex than you listed here as I’ve argued extensively: https://www.ribbonfarm.com/2016/06/09/goodharts-law-and-why-measurement-is-hard/ - but you were much more concise, and I should similarly formalize my understanding. But this is really helpful, and I should be in touch with you about formalizing some of this further if I ever get my committee to sign off of this dissertation.
Note to add: We did formalize this more, and it has been available on Arxiv for quite a while.
I think the example of sugar is off. Sugar was not originally a proxy for vitamins, because sugar was rarer than vitamins. A taste for sugar was optimizing for calories, which at the time was heavily correlated with survival. If our ancestors had access to twinkies, they would have benefited from them. The problem isn’t that we became better at hacking the sugar signal, it’s that we evolved an open ended preference for sugar when the utility curve eventually becomes negative.
A potential replacement: we evolved to find bright, shiny colors in fruit attractive because that signified vitamins, and modern breeding techniques have completely hacked this.
I worry I’m being pedantic by bringing this up, but I think the difference between “hackable proxies” and “accurate proxies for which we mismodeled the underlying reality” is important.
Hmm, I think the fact that if our ancestors had access to twinkies they would have benefitted from them is why it is a correct example. The point is that we “learned” sugar is good from a training set in which sugar is low. Then, when we became better at optimizing for sugar, sugar became high and the proxy stopped working.
It seems to me that you are arguing that the sugar example is not Adversarial Goodhart, which I agree with. The thing where open ended preferences break because when you get too much the utility curve becomes negative, is one of the things I am trying to point at with Extremal Goodhart.
Okay, I think I disagree that extrapolating beyond the range of your data is Goodharting. I use the term for the narrower case where either the signal or the value stays in the trained range, but become very divergent from each other. E.g. artificial sweeteners break the link between sweetness and calories.
I don’t think this is quite isomorphic to the first paragraph, but highly related: I think of sweetness as a proxy for calories. Are you defining sweetness as a proxy for good for me?
I am thinking of sugar as a proxy for good for me.
I do not think that all instances of training data not matching the environment you are optimizing are Goodhart. However if the reason that the environment does not match the training is because the proxy is large, and the reason the proxy is large is because you are optimizing for it, then the optimization causes the failure of the proxy, which is why I am calling it Goodhart.
Inverse adversarial: adversaries try to affect your choice of proxy to already be aligned with their goals.
Very interesting! I like this formalization/categorization.
Hm… I’d have filed “Why the tails come apart” under “Extremal Goodhart”: this image from that post is almost exactly what I was picturing while reading your abstract example for Extremal Goodhart. Is Extremal “just” a special case of Regressional, where that ellipse is a circle? Or am I missing something?
Height is correlated with basketball ability.
Regressional: But the best basketball player in the world (according to the NBA MVP award) is just 6′3″ (1.91m), and a randomly selected 7 foot (2.13m) tall person in his 20s would probably be pretty good at basketball but not NBA caliber. That’s regression to the mean; the tails come apart.
Extremal: The tallest person on record, Robert Wadlow, was 8′11″ (2.72m). He grew to that height because of a pituitary disorder, he would have struggled to play basketball because he “required leg braces to walk and had little feeling in his legs and feet”, and he died at age 22. His basketball ability was well below what one might naively predict based on his height and the regression line, and that is unsurprising because the human body wasn’t designed for such an extreme height.
Great example!
It would be really nice if we had an example like this that worked well for all four types.
Adversarial: A college basketball player who wants to get drafted early and signed to a big contract grows his hair up, so that NBA teams will measure him as being taller (up to the top of his hair).
And:
-- http://www.nytimes.com/2003/06/15/sports/basketball/tall-tales-in-nba-dont-fool-players.html
I’m trying to think of a causal goodheart one. A bad one I came up with is that if someone thinks the reason taller people get better careers is because the hiring committe likes tall people, and so the person wears heels in their shoes, then this is a causal godheart because they’re trying to win on a proxy but in a way causally unrelated to the goal of having a good career.
But everyone knows the true causal story and don’t make this mistake, so it’s not a good example. Is there a causal story people don’t know about? Like perhaps some false belief about winning streaks (as opposed to the standard Kahneman story of regression to the mean).
Causal: An early 1900s college basketball team gets all of their players high-heeled shoes, because tallness causes people to be better at basketball. Instead, the players are slowed and get more foot injuries.
Adversarial: The New York Knicks’ coach, while studying the history of basketball, finds the story about the college team with high heels. He gets marketers to go to other league teams and convince them to wear high heels. A few weeks later, half of the star players in the league are out, and the Knicks easily win the championship.
I thought of almost this exact thing (with stilts). I like it and it is what I plan on using for when I want a simple example. It wish it was more realistic though.
Extremal is not a special case of regressional, but you cannot seperate them completely because regressional is always there. I think the tails come apart is in the right place. (but I didn’t reread the post when I made this)
If you sample a bunch of points from a multivariate normal without the large circular boundary in my example, the points will roughly form an ellipse, and the tails come apart thing will still happen. This would be Regerssional Goodhart. When you add the circular boundary, something weird happens where now you optimization is not just failing to find the best point, but actively working against you. If you optimize weakly for the proxy, you will get a large true value, but when you optimize very strongly you will end up with a low true value.
Are U and V swapped here? I was expecting the discussion of Extramal Goodhart to be about extreme proxy values, and U was defined to be the proxy at the beginning of the post.
They were. Fixed. Thanks!
I upvoted the post for the general theory. On the other hand, I think the examples could be more clear and it would be good to find examples that are more commonly faced.
I agree! The main product is the theory. I used examples to try to help communicate the theory, but they are not commonly faced at all.
It seems to me that the discussion on “Extremal Goodhart” is a bunch of good examples of what would be, following your ideas, “Nonlinear Goodhart”. It’s rather obvious that someone who is 270cm tall has an unusual body that could go either way for basketball, but it’s less clear what happens to someone who is 215 cm (when the NBA average is 198cm).
We basically observe correlations “locally”, in a neighborhood of the current values. Therefore we think of them as essentially linear, because every smooth function is “arbitarily close to linear in an arbitrarily close vicinity” (the first-order Taylor expansion).
So let’s say, for example, that we see basketball players with the following heights and (abstract, think of Elo scores) “power ratings”
170cm − 0.8
180cm − 1.0
190cm − 1.2
200cm − 1.5
Note that this is already not linear, but gives salience to a lower threshold above which marginal increases in height give large increases in power. But because very tall people are scarce, we don’t know clearly whether already at 220cm power ratings are still increasing, let alone if they are increasing at smaller increments. Combined with Adversarial Goodhart, this is the stuff of asset bubbles.
I distinguish this from Regression Goodhart because the chief operating principle there is that we’re confused by noise—Goodhart was after all a central banker in the 60s, an era in which macroeconomists only had noisy quarterly datasets going back to the late 40s.
Also, as a related side-point, I’d add Cobra effect to the list.
Model: V is the true goal, but can’t be incentivized. R is an easily measured consequence of V which can be accomplished other ways. If any of the other ways to acheive R are easier than accomplishing the true goal, Agents will pursue that instead of the intended goal.
Example: A baskeball player wants to impress the NBA recruiters, so he pays the other team to lose the game. They do so by not showing up, thereby forfeiting and losing, as agreed.
The link to “The Optimizer’s Curse” in the article is dead at the moment (<https://faculty.fuqua.duke.edu/~jes9/bio/The_Optimizers_Curse.pdf>), but I think I found it at <https://jimsmith.host.dartmouth.edu/wp-content/uploads/2022/04/The_Optimizers_Curse.pdf>. If that’s the right one, can you update the link?
> The fact that you choose an algorithm does not effect its performance, and you don’t have to worry about Causal Goodhart.
But now, I think you have to worry about a “Regressional Goodhart”
Maybe this would be pedantic to point out, but your choice of the best-performing model on test data is likely to have done that well by chance, as the number of models evaluated increases (hence validation and test splits).
(I will retry this later once I can figure out how to do image posts.)
(You can do images in posts but not comments. To do it in posts, bring up the highlight menu / double-click menu and click the image button, and then give it a link to the image (an imgur link or something).)
Or use markdown syntax for images.