Goodhart’s Law Causal Diagrams
Goodhart’s law is closely related to both inner and outer alignment problems and the principal agent problem. Understanding it better should help us solve these problems. This post is an attempt to work out a more precise, unified way of thinking about part of Goodhart’s law.
More specifically, we will interpret Goodhart’s law using causal diagrams. We will introduce this framing and look at one class of situations where a Goodhart problem might arise. In future posts, we will describe a more complete categorisation of these situations and how they relate to the AI alignment problem.
Definition of Goodhart’s law
The standard definition of Goodhart’s law is: “Any observed statistical regularity will tend to collapse once pressure is placed upon it for control purposes.”
What does this mean precisely and in what ways does this happen? Let’s zoom in on each part of this definition.
What does “Any observed statistical regularity” mean? In the language of causal diagrams, this statement can be translated as some set of causal relationships between observed and controlled variables. In the case of Goodhart’s law, one of these variables is a target (measures what you actually want), and one is a proxy (the variable you measure while optimizing for the target). There may be a complex set of connections and latent variables between the proxy, the target, and any variables under your control when taking actions. Let’s call any variable under your direct control an action variable.
Given this interpretation of “observed statistical regularity” what does “will tend to collapse once pressure is placed upon it for control purposes” mean? At a high level, it means that when an attempt is made to maximize the proxy variable by changing the action variables, then something about the relationship between the proxy and the target will (usually) change. Understanding Goodhart problems requires understanding both why this general situation comes up regularly, and the mechanics of how the problem occurs.
Causal Diagrams
In this causal diagram, we have the true target goal Value, which represents the variable that we want to maximize, and four potential proxy variables (Parent, Sibling, Child and Disconnected) that are each correlated with Value. Since this is a causal diagram, each node is a random variable, and the value of a parent node “causes″ the value of the child node. For the purposes of this post, you don’t need to understand the technical definition of causality here, you only need to understand it at an intuitive level.
Let’s work through some examples to see how the causal structure framework can be applied.
Child proxy
You are checking the quality of water by taking a sample and testing for a small set of common contaminants. What you want to maximize is the quality of water, so the test is a child node of the water quality. This goes wrong when the water is contaminated with something that isn’t being tested for. If the choice of chemicals to test for is under our control, then we will select a test for unlikely and irrelevant chemicals.
Parent proxy
Having more money has a causal influence on how happy you will be, so we could say money is a parent of happiness. Say our goal is to maximize our own happiness, and we decide to use money as a proxy for happiness. Then we are using a parent proxy, which will work for a little while, until we’ve saturated the ability of money to increase happiness but continue to seek money at the expense of other parents of happiness, like having free time or good relationships.
Disconnected proxy
We have a new intervention for mitigating climate change: Simply increase the number of pirates! The number of pirates is inversely correlated with world average temperature over the last 200 years or so. If we mistake this correlation for causation, then we might try using the number of pirates as a proxy for world temperature. This incorrect inference could be caused by p-hacking, or not taking into account the low prior probability of causal relationships between any two given variables.
Sibling proxy
You give a job to people who pass a test. Doing well on the test is causally downstream of having the relevant skills for a job. The target in this situation is “doing the job well”, which is a child of “having the relevant skills”. So in this situation you are using a sibling node as the proxy measure. This could break down by the proxy having another parent that is downstream of your action variables. For instance, if knowledge about the test gets out then the correlation starts to break down, because passing the test is also causally downstream of knowing the test questions in advance.
Disconnected / Far away proxy
You want to improve the economy of a country by maximizing GDP, so you find a big dataset that has lots of statistics about countries. You find every variable that you have control over, work out which ones are correlated with growth, and optimize over those. Many of these variables are causally disconnected or far away from growth. For example, maximizing the number of amusement parks won’t have the benefit you hoped for, because it has a distant relationship with GDP growth that is not fully understood.
More detailed causal diagrams
The first causal diagram above is a simpler representation of a more accurate but more complex model, which looks more like this:
In this diagram, there are many irrelevant background variables and latent variables and there is a different version of each variable at every timestep. Every variable has a “lightcone” of variables in the past that could influence its value, and variables in the future whose value could be influenced (the lightcone for node Value at t=0 is shown). Note that in the causal diagram your actions tend to affect things in the lightcone of the proxy variable you are aiming for. It’s important to keep this more complex diagram in mind even if drawing a simplified version, otherwise it’s easy to become confused about things like variables that change over time or causal loops. Another ambiguity that this diagram helps with is that the Value you try to change has to always be in the future, and is usually a sum of value over time.
Why do we not target value directly?
Mistargeting is what we call a category of causes for Goodhart problems where the cause of the problem is a result of choosing a bad proxy. When it comes to AI alignment, we can break this down into different reasons this could happen.
Mistakes (missing the target)
A mistake about what you value. E.g. You thought you wanted money, so you optimized for that, but you actually want lots of free time and money.
Deliberate approximations (hitting the target directly is too expensive or difficult)
Designer’s proxy: For simplicity, efficiency or attention reasons, an AI designer specifies an AI’s goals to be a proxy for the designers’ true goals. E.g. designer chooses a standard classification loss, even though they actually care about some types of errors more than others.
AI’s proxy: AI chooses to pursue a subgoal, e.g. make money to be used later. Inner misalignment is the combination of this and the pursuit of that subgoal having enough optimization power behind it and not being held in check by the surrounding AI.
Mitigation
A few strategies seem like reasonable approaches to reducing the impact of mistargeting value.
Actually just measure the target
Even if the target is expensive to measure, it might be worth measuring it properly every now and then, and using this to validate your proxies.
Be careful about the identity of variables
Break up variables that might be mashed together in your world model, and experiment with adjusting your concept boundaries. This helps prevent mistakes about what you actually value.
Track many proxies
Use multiple proxies to validate each other.
Attempt to find a set of proxies that separates your action variables from your target variables. This means that your proxies will capture all the ways your actions affect the node you value. If you know precisely the relationship between the proxies and the target then (except for some additional noise) this is just as good as measuring the target variables directly and Goodhart’s law no longer applies.
At random times select from a wide set of proxy variables.
Reduce impact of actions
Try to avoid actions that change the circumstances in unknown or complex ways, to reduce the chances of interfering with the relationships you care about.
Improve your model
Always try to improve your model of the relationship between your proxy, your goal and your actions. This will always be done on very sparse data, because otherwise you wouldn’t be optimizing the proxy. Learning with very little data always requires good prior information.
Choose proxies that are expensive to interfere with
This is most relevant in adversarial situations, which we mostly haven’t talked about yet. In non-adversarial situations, this corresponds to choosing proxies that are closer and more robustly linked to the target under natural change in background variables.
Sufficiently good understanding of mitigation strategies will be useful for designing inner aligned agents (by building these mitigation strategies into an agent), and outer alignment (by agent designers better understanding how to select goals and identify failure cases). What is another approach for reducing the impact of Goodhart’s law?
Conclusion
A concrete and easy to apply understanding of Goodhart is valuable, because understanding Goodhart deeply and mitigating it is one of the most central problems for AI alignment research. We hope this causal perspective will be useful for thinking about this problem, because it unifies and clarifies a lot of thought about Goodhart problems. In two planned posts, we will further extend it by discussing scenarios where actions influence the future causal structure, scenarios where an imperfect understanding of the causal structure leads to poor generalization, and tighten the relevance of these ideas to AI Alignment.
Core ideas for this post are from Justin Shovelain with Jeremy Gillen as the primary writer.
The initial section seems very closely related to the work Scott did and we wrote up in the “Three Causal Goodhart Effects” of this paper; https://arxiv.org/pdf/1803.04585.pdf, including some of the causal graphs.
Regarding mitigations, see my preprint, here: https://mpra.ub.uni-muenchen.de/98288/, which in addition to some of the mitigations you discussed, also suggest secret metrics, randomization, and post-hoc specification as strategies. Clearly, these don’t always apply in AI systems, but can potentially be useful in at least some cases.
I think causal diagrams naturally emerge when thinking about Goodhart’s law and its implications.
I came up with the concept of Goodhart’s law causal graphs above because of a presentation someone gave at the EA Hotel in late 2019 of Scott’s Goodhart Taxonomy. I thought causal diagrams were a clearer way to describe some parts of the taxonomy but their relationship to the taxonomy is complex. I also just encountered the paper you and Scott wrote a couple weeks ago when getting ready to write this Good Heart Week prompted post, and I was planning in the next post to reference it when we address “causal stomping” and “function generalization error” and can more comprehensively describe the relationship with the paper.
In terms of the relationship to the paper, I think that the Goodhart’s law causal graphs I describe above are more fundamental and atomically describe the relationship types between the target and proxies in a unified way. I read how you were using causal diagrams in your paper as rather describing various ways causal graph relationships may be broken by taking action rather than simply describing relationships between proxies and targets and ways they may be confused with each other (which is the function of the Goodhart’s law causal graphs above).
Mostly the purpose of this post and the next are to present an alternative, and I think cleaner, ontological structure for thinking about Goodhart’s law though there will still be some messiness in carving up reality.
As to your suggested mitigations, both randomization and secret metric are good to add though I’m not as sure about post hoc. Thanks for the suggestions and the surrounding paper.
Douglas Hubbard’s book How to Measure Anything provides good examples of what it looks like for a target to be expensive to measure—frequently what it looks like is for the measurement to feel sloppy or unrigorous (because precision is expensive) - so it’s a common mistake to avoid trying to measure what we care about directly but sloppily, in order to work with nice clean quantitatively objective hard-to-criticize but not very informative data instead.
Did this ever happen?
No, Justin knows roughly the content for the intended future posts but after getting started writing I didn’t feel like I understood it well enough to distill it properly and I lost motivation, and since then I became too busy.
I’ll send you the notes that we had after Justin explained his ideas to me.