Goodhart Typology via Structure, Function, and Randomness Distributions

(Work done at Convergence Analysis. The ideas are due to Justin. Mateusz wrote the post. Thanks to Olga Babeeva for feedback on this post.)

In this post, we introduce the typology of structure, function, and randomness that builds on the framework introduced in the post Goodhart’s Law Causal Diagrams. We aim to present a comprehensive categorization of the causes of Goodhart’s problems.

But first, why do we care about this?

Goodhart’s Law recap

The standard definition of Goodhart’s Law is: “when a proxy for some value becomes the target of optimization pressure, the proxy will cease to be a good proxy.”.

More specifically: we see a meaningful statistical relationship between the values of two random variables and ,[1] where we have some preferences over the values of . We decide to change the value of , expecting that the value of will change as would be expected by naively extrapolating the statistical tendencies of the two variables. We fall victim to Goodhart’s Law (i.e. a Goodhart failure occurs) if the change in doesn’t quite meet our expectations and, upon learning that, we see that we should have intervened in some different way.

Some motivation

Many alignment failures can be cast in terms of Goodhart’s Law. Specifically, the AI is given an objective that doesn’t fully specify what we want it to optimize for (outer alignment), or it finds a way to exploit the misspecification of the training goal, developing an internal objective that is misaligned with the one we specified (inner alignment /​ mesaoptimization). Therefore, a better understanding of Goodhart’s Law — its various subtypes, causes, and possible remedies — might better inform us about how to effectively anticipate and prevent (or at least mitigate) those failure modes. In addition to technical AI alignment/​safety, a better understanding of Goodhart’s Law might shed some light on enhancing human coordination, given that the same outer and inner alignment problems (or close analogs thereof) recur in coordination structures. In particular, effective AI governance will most likely require specifications of regulations that are robust to goodharting.

Introduction

In the previous post, we presented how causal diagrams can be used to describe situations vulnerable to Goodhart’s problems. In this post, we build on that framework to present a complete categorization of possible causes of Goodhart’s problems.[2]

The key principle to hold in the back of one’s mind is the following.

All Goodhart problems stem from misunderstanding the causal relationships between the intervention, target, or measurement variables.

More specifically, if you know all relevant facts about the relationship between

1. your interventions (),

2. the target variable you want to optimize (), and

3. your measurements of the target variable (),

then, in principle, you know enough to infer what interventions are optimal for maximizing the expected value of the target variable (according to your preference ordering, whatever it is).

The only reasons expected utility maximization can encounter Goodhart problems are ignorance (e.g., not knowing the right method for inferring the right course of action in a given circumstance) and computational limitations.

Here, we present a framework flexible enough to cover all the Goodhart-related failure modes in terms of mistaken beliefs about the causal diagram. We discuss various examples throughout the post.

By this motivation, we are casting all categories of Goodhart problems as mistakes of inference under uncertainty, where the ultimate object of inference is the choice of action and the intermediate object of inference is the causal diagram.

We propose a classification of causes of Goodhart failures into three categories, representing which aspect of the causal diagram the agent failed on.

  1. Structure. The agent is mistaken about the causal structure mediating between the intervention variable , the target variable , and the measure variable .

  2. Function. The agent is mistaken about the functional relationship defining the interactions between the intervention variable, the target variable, and the measure variable.

  3. Randomness. The agent is mistaken about the random distribution of the auxiliary random variables in the graph.

If we think of an idealized process of learning the causal diagram, we can think about learning these three aspects one after another and identifying which link in the chain failed.

Another dimension worth keeping in mind describes the reason the agent failed on this aspect of the causal diagram. It is not the main focus of this post but it is useful to keep in mind as it might be relevant for finding strategies to mitigate Goodhart’s problems.

  1. Learning failure. The agent has learned a causal structure that was wrong already at the time of learning.

  2. Transfer failure. The agent has learned a causal structure that was right at the time of learning but became “deprecated” or “outdated” by the time of intervention, i.e. it failed to transfer (e.g. the environment changed or some processes optimized against the agent).

  3. Non-interference failure. The agent has learned a causal structure that would accurately describe the relationships between variables at the time of intervention but the agent failed to model the impact of its actions on the causal structure. The agent’s intervention appears to have broken the causal structure because the actions’ effects also flowed through “invisible side channels” that the learned causal diagram didn’t account for.[3]

Cause of failure

Failed aspect
Learning failureTransfer failureNon-interference failure
StructureLearned wrong causal structureLearned correct causal structure but it failed to transfer to intervention timeLearned correct causal structure that transferred to intervention time but intervention broke alignment between the causal structure and its representation
FunctionLearned wrong functional dependenciesLearned correct functional dependencies but they failed to transfer to intervention timeLearned correct functional dependencies that transferred to intervention time but intervention broke alignment between the functional dependencies and their representation
RandomnessLearned a wrong distribution of random componentsLearned correct distribution of random components but it failed to transfer to intervention timeLearned correct distribution of random components that transferred to intervention time but intervention broke alignment between the distribution of random components and its representation

Additionally, we might want to distinguish between failures due to insufficient knowledge and those due to constraints on resources necessary for computing the impact of an action (time, space, algorithms). However, given that we assume knowledge results from inference anyway, these two categories blur into one. Therefore, this distinction is irrelevant in this post.

For simplicity, this post focuses on simplified situations where learning time and intervention time are distinct and learning proceeds only by observation, not experimentation. Situations involving continuous learning are naturally much more complex but we expect the concepts and lenses from this post to generalize to them.

Ontology

Causal diagrams (re-)introduced

Goodhart’s Law Causal Diagrams introduced causal diagrams as a framework for thinking about Goodhart’s Law but did not treat causal diagrams rigorously. Here, we introduce causal diagrams formally.

The purpose of causal diagrams is to comprehensively model the dependencies between random variables. Their primary use case is to determine how specific interventions on the causal structure will influence our variable of interest.

A causal diagram (CD) is a special kind of directed acyclic graph (DAG). Its nodes represent random variables and its edges represent causal relationships between the random variables. The existence of an edge means “the value of (co-)determines the value of . The value of the random variable is fully determined by a function where through are the parents of and is its random auxiliary variable, representing a random component of (“U” for “unknown”). In other words, the value is determined by a function with (slight abuse of notation: we use to denote both the random variable and its distribution).

In a CD, auxiliary variables stand for irreducible sources of randomness and are the only sources of probabilistic uncertainty. An auxiliary variable can have any random distribution. Any probabilistic dependence between random variables can be neatly modeled using that kind of mix of deterministic functions and random distributions. Every random variable in a CD has exactly one auxiliary variable . Therefore, orphans (nodes without parents) in a CD, are exactly auxiliary variables and if a proper random variable has no parents except its auxiliary variable, its value is fully determined by the value of its auxiliary variable.[4]

We have just introduced causal diagrams in the following order:

  1. Structure — Nodes represent random variables, and edges represent causal relationships between them.

  2. Function — The value of a node is determined by a function from all the nodes that send an edge to it…

  3. Randomness — … and a random component , AKA the auxiliary variable of .

The three “steps” provide a handy model for thinking about the process of an agent learning a causal diagram, as each step increases the resolution to what has been achieved after the previous one.[5] Another reason to think about them in this exact sequence is that to get the function aspect right, we first need to get the structure aspect right (basically by definition); and, similarly, to get the randomness aspect right, we need to get the function aspect right first (this is not strictly true but approximately true). This motivates the categorization of errors in learning and using causal diagrams into errors of structure, function, and randomness, which we describe in detail in Section 2.

First, however, we need to describe how intervening on a causal diagram works, and for that purpose, we need to describe how we’re going to use causal diagrams to think about the relationship between the intervention, target, and measure variables.

Importantly, for the causal diagram to be useful in practice, it needs to be repeatable, i.e. it should retain its applicability to a given kind of situation over several iterations, rather than being useful only for a single “rounds” after which it can no longer be used. For example, if we want to use a CD to predict the effect of a policy on the economy, we would like it to still be useful after we execute the intervention that we deem the most desirable based on the CD.

Intervention, target, and measure

Let’s start with a diagram:

So, we have three variables of interest. is the target variable that we want to optimize. is the intervention variable that we are influencing to optimize the target . In general, we may not have direct access to the value of the target and can only read it through , the measure variable. Since we want to influence​​​​​​​ and we want to influence , needs to be (causally) upstream of whereas needs to be (causally) upstream of : I→T→M.

Therefore, we need to account for two causal relationships, that of on and that of on .[6]

Intervening on a causal diagram works by “taking control” over the values of one or more random variables. Here, we focus on the simplest case, where we choose one or more variables and set the value of each to some fixed value, ignoring what value (or rather, what distribution of values) they would have otherwise, given some values of their parent variables.[7]

The structure of the diagram is fully deterministic, with the only free parameters being the orphan random auxiliary variables over which we have some probability distributions, and, obviously the intervention variable (if we choose to intervene). Therefore, to determine the probability distribution over conditional on some , we need to extend the function to where through are random ancestors of .

Then, we can define the probability distribution:

Then, the expected value conditional on is:

This is also how we define the induced utility function on :

An exactly analogous situation arises between and . We extend to where through are random ancestors of .

For the most part, we will not discuss the measurement variable explicitly. However, it is relevant to keep this in mind because the information about the dependence between and influences the learning of the causal diagram. It is also obviously relevant for continuous learning.

Goodhart failure

Given a utility function, over the target node a Goodhart failure occurs when the change in doesn’t quite meet our expectations and, upon learning that, we see that we should have intervened in some different way.

To give a simple example: Suppose we want to maximize the number of fruits in a basket containing some number of apples, bananas, oranges, and some number of other fruits and we can choose the number of apples to be anything from 0 to 100. Our probabilistic beliefs about the numbers of bananas and oranges are represented by random variables and and there is some additional uncertainty involved in the number of fruits in the basket standing for “other”, . Since our utility function is linear in the number of each type of fruit, we choose 100 apples. If the utility function were non-monotonic — e.g. with being the number of fruits — we might need to choose something else than one hundred, depending on the details of our beliefs about bananas and oranges.

Types of Goodhart failures

Structural errors

A structural error occurs when the agent is mistaken about the structure of the causal diagram, i.e. nodes (random variables) and edges (causal relationships).

Being mistaken about the causal structure mediating the influence of on can be a source of error in deciding on the intervention (i.e. on the ). However, given that the agent’s assessment of which nodes are useful as intervention variables is dictated by (among other things[8]) the causal and functional structure between the node (potential ) and , being mistaken about the causal structure can also lead us to choose a wrong .

These problems can take several forms, which can also occur together, such as failing to identify all (and only) the relevant variables and missing existing causal links, including non-existent causal links.

This kind of error will occur if we learn the wrong causal structure between the measure and the target, since in such a case we may not have enough knowledge to robustly optimize the target. A similar situation occurs if we learn a causal diagram that was a good representation at the learning time but when it fails to generalize to the time of intervention because the environment has changed (for whatever reason). However, for our purpose, the distinction between mis-learning the CD and learning a correct CD that doesn’t transfer into the future is not important because what we care about is learning a CD that is appropriate for enabling future optimization of .

The more interesting (and more tricky) situation arises when the agent learns a causal structure that would be appropriate for the future situation, if not for the agent’s own intervention that “broke” it (i.e. caused it to fail to generalize). In a strict sense, this is impossible because there is no way to change a causal diagram by intervening on its nodes. In a less strict sense, however, if the causal diagram is inappropriate, — e.g. the functional dependencies represented by some edges actually depend on the values of some nodes that haven’t been included in the diagram — this may make it appear as if the causal structure is being violated.

Examples

  1. Smoking and tobacco: “For a particular individual, it is impossible to definitively prove a direct causal link between exposure to a radiomimetic poison such as tobacco smoke and the cancer that follows; such statements can only be made at the aggregate population level. Cigarette companies have capitalized on this philosophical objection and exploited the doubts of clinicians, who consider only individual cases, on the causal link in the stochastic expression of the toxicity as an actual disease.”.

  2. Amyloid beta plaques associated with Alzheimer’s disease were long suspected to be causally involved in the advancement of the disease. This motivated medical research to aim at developing drugs targeting those plaques. However, amyloid beta plaques may be just a symptom of the disease, and eradication of the symptom (even if successful) is not necessarily going to help with the disease (depending on the specific mechanism of symptom eradication).

  3. Blue zones: “A blue zone is a region in the world where people are claimed to have exceptionally long lives beyond the age of 80 due to a lifestyle combining physical activity, low stress, rich social interactions, a local whole foods diet, and low disease incidence.” However, more recent evidence suggests that exceptional numbers of reports of centenarians and supercentenarians were caused by poor record keeping in these poor areas, allowing for fraud.[9]

  4. Survivorship bias: “During World War II, the statistician Abraham Wald took survivorship bias into his calculations when considering how to minimize bomber losses to enemy fire. The Statistical Research Group (SRG) at Columbia University, which Wald was a member, examined the damage done to aircraft that had returned from missions and recommended adding armor to the areas that showed the least damage. The bullet holes in the returning aircraft represented areas where a bomber could take damage and still fly well enough to return safely to base. Therefore, Wald proposed that the Navy reinforce areas where the returning aircraft were unscathed, inferring that planes hit in those areas were the ones most likely to be lost.”

Functional errors

A functional error happens when the learned functional relationship between and is not appropriate for the intervention time.

Recall that we started with a utility function over the value set of the target node and a function specifying how the value of depends on the variables upstream from it. Then we used that to find an appropriate intervention node in the diagram and then to induce a utility function over its possible values, such that choosing the values of to maximize will translate into the target value also maximizing its preference ordering (on average, in expectation, &c. given that we’re operating in a probabilistic setting).

We might take it as our desideratum: given a utility function and a functional relationship , derive a preorder on the intervention variable such that maximization of the latter transfers, through , to maximize the former. (In the language of order theory, we can say that is monotone in the argument .[10])

A functional error occurs when this desideratum fails, i.e., optimizing fails to translate into optimizing of (as robustly as the agent wants). This can be either because the agent was mistaken about the function itself (e.g. neglected some non-linear interactions between the intervention value and other arguments of the function) or because it mis-inferred the preference ordering over . (This lines up with the distinction between incomplete knowledge and insufficient computation.)

Importantly, the preference ordering over doesn’t have to be pre-computed at learning time. Even if we are “just” inferring what interventions to make at intervention time, we are going to be comparing alternative values which amounts to computing that preference ordering locally.

Examples

  1. Kleiber’s Law: “Kleiber’s law, named after Max Kleiber for his biology work in the early 1930s, states, after many observations that, for a vast number of animals, an animal’s Basal Metabolic Rate scales to the 3⁄4 power of the animal’s mass. [...] Before Kleiber’s observation of the 34 power scaling, a 23 power scaling was largely anticipated based on the “surface law”, which states that the basal metabolism of animals differing in size is nearly proportional to their respective body surfaces. This surface law reasoning originated from simple geometrical considerations. As organisms increase in size, their volume (and thus mass) increases at a much faster rate than their surface area. Explanations for 2⁄3-scaling tend to assume that metabolic rates scale to avoid heat exhaustion. Because bodies lose heat passively via their surface but produce heat metabolically throughout their mass, the metabolic rate must scale in such a way as to counteract the square–cube law. Because many physiological processes, like heat loss and nutrient uptake, were believed to be dependent on the surface area of an organism, it was hypothesized that metabolic rate would scale with the 23 power of body mass.”

  2. Malthusianism: “Malthusianism is a theory that population growth is potentially exponential, according to the Malthusian growth model, while the growth of the food supply or other resources is linear, which eventually reduces living standards to the point of triggering a population decline. This event, called a Malthusian catastrophe (also known as a Malthusian trap, population trap, Malthusian check, Malthusian crisis, Point of Crisis, or Malthusian crunch) has been predicted to occur if population growth outpaces agricultural production, thereby causing famine or war.”

    1. Malthus’s predictions were not borne out and they had not been justified at the time of his writing either.

  3. Exposure of small children to allergens has been advised against. This turned out to have consequences exactly opposite to the ones that were desired: without allergen exposure, the immune system doesn’t have an opportunity to learn to distinguish safe from non-safe food.

  4. Before Kepler, planets were believed to follow circular orbits and move at constant speed. Kepler revised the heliocentric model and postulated elliptic, not circular, orbits, and varying speed of motion. This was an improvement on the model of the relationship between time and position of a planet.

  5. The relationship between alcohol and health outcomes was believed to follow a U-shaped curve with a small amount of drinking being the most optimal. After accounting for the fact that many teetotalers are people with a history of alcohol abuse, the initial segment of the curve flattens and there is no strong evidence for meaningful effects of small amounts of alcohol consumption on health outcomes either way.

Calibration errors

We might be mistaken about the nature of the random auxiliary variables in the causal diagram. This is miscalibration, hence we call it calibration errors.

The target variable depends on all the auxiliary variables that are upstream from it, except for those that route only through the intervention variable if it is being intervened on.

Miscalibration about the distribution of any of the auxiliary variables might (depending on other details of the causal diagram) have large consequences for how the expected results of an intervention diverge from its actual results.

If miscalibration pertains to a sort of additive noise on the target variable — i.e. the function being of the form (or something similar, with noise propagating from its ancestors) — then the result is relatively mild because it doesn’t impact what interventions maximize the utility function . However, if the distribution is and

then we need to account for the distribution in our actions, i.e. always choose . This is one simple model of how we may need to account for the random component in choosing the intervention value.

Examples

  1. The assumption that accident severities follow a thin-tailed distribution when in fact they follow a fat-tailed distribution will, in the long run, lead to inadequate preparation for the most severe accidents.

    1. In practice, human collectives often tend to inadequately prepare for natural disasters, pandemics, etc (some of which can be explained by over-relying on the availability heuristic).

  2. If what the agent is modeling as “merely” noisy/​stochastic components of the system is actually adversarially selected, then interventions chosen at minimizing loss in the face of stochastic noise will likely fail in the face of adversarial noise (i.e. noise selected/​optimized so as to pessimize against the agent’s objectives).

  3. Disregarding the possibility that the apparent deviation from the usual range is due to stochasticity and then taking action to bring it back to the usual range might lead the agent to conclude that its action took the desired effect even if the observation was actually due to the regression to the mean.

Potential extensions & further directions

  • Extend the framework to situations of continuous and active learning.

  • Identify more which real-life Goodhart problems arise due to mistakes of structure, function, and randomness.

    • Technical AI safety/​alignment: reward-hacking, gradient hacking, goal misgeneralization.

      • How do you get the AI to (robustly) aim for the right thing and not for a bad thing?

    • AI governance: e.g. specification gaming by the AI labs.

  • Can this framework throw some light on wisdom/​coherence?

  • Build an applied rationality training paradigm/​technique to figure out where people are failing on this.

Appendices

Order-theoretic details

Goodhart’s law pertains to situations involving optimization and optimization presumes some way of specifying which possible values of the target node are better or worse, i.e. a preference ordering over . Minimally, a preference ordering needs to be a partial order relation. However, it is most convenient when we can compare any two values and tell which one is better or worse, i.e. when the preference ordering is a total preorder. However, given that we are dealing with reasoning under uncertainty and want to maximize the expectation of the preference over the target, we need to be able to trade-off between lotteries induced by different values to which we might fix the intervention value. This essentially necessitates a utility function on the target random variable:. (We’re using instead of to avoid confusion with the notation for auxiliary variables.) However, if the randomness aspect is negligible, we can do it just with a total preorder or, even weaker, a demand that any set of interventions/​actions that the agent might encounter, contains its own join (a “locally best element” is defined), that is: .

Relationship to Scott Garrabrant’s Goodhart Taxonomy

In the post Goodhart Taxonomy, Scott Garrabrant described four kinds of situations giving rise to Goodhart problems. They can be incorporated in our framework as follows.

Regressional Goodhart (“When selecting for a proxy measure, you select not only for the true goal, but also for the difference between the proxy and the goal.”) occurs when intervention affects measure not only through target but also through a direct connection. What Scott calls “noise” corresponds to the direct connection. (It cannot correspond to an auxiliary random variable because those are, by definition, orphans, unaffected by the values of other variables.)

Causal Goodhart (“When there is a non-causal correlation between the proxy and the goal, intervening on the proxy may fail to intervene on the goal.”) occurs when we fail to understand the structure, e.g. the intervention is not causally upstream from the target.

Extremal Goodhart (“Worlds in which the proxy takes an extreme value may be very different from the ordinary worlds in which the correlation between the proxy and the goal was observed.”) is a special case of functional error.

Adversarial Goodhart (“When you optimize for a proxy, you provide an incentive for adversaries to correlate their goal with your proxy, thus destroying the correlation with your goal.”) is just a special case of the environment at intervention time being different than at learning time.

  1. ^

    The choice and meaning of these letters will be clarified soon.

  2. ^

    We use “Goodhart’s law” for the general regularity that “Whenever a measure becomes a target, it ceases to be a good measure.”. We use “goodharting” or “Goodhart’s problem(s)” for specific instances when the correlation between an intervention variable and the target collapses when optimization pressure is placed upon the former. A Goodhart failure is a Goodhart problem left unaddressed.

  3. ^

    Possible reasons include: limited time for computation/​thinking, incomplete knowledge available to the agent and the executed actions being different than intended actions (e.g. leaking to other nodes, imperfect control over the actuators).

  4. ^

    Every random variable needs to have a random component in order to make it possible to infer a causal diagram from the data because Pearl’s paradigm can’t infer causality on variables that are related deterministically. Making causal inference possible even in a fully deterministic setting was a major motivation behind Scott Garrabrant’s Factored Space Models (formerly Finite Factored Sets).

  5. ^

    Although they don’t necessarily correspond to the order in which causal diagrams are learned in practical applications.

  6. ^

    In general, the relationship between and may not be direct but rather mediated by intermediate random variables, along with their ancestors and auxiliary random components. We’re omitting this detail here because even then it can still be abstracted away (collapsed into) a single edge specifying with a function that determines the value of based on the value of and the values of other ancestors of the intermediate nodes.

  7. ^

    We might also be more nuanced and modify the functional relationship between the variable and its parents or add new causal and functional dependencies. We might also be able to control the value of the node only partially, e.g. clamp not a single value but some distribution over values to the intervention node. In this post, we focus on the simplest variant involving fixing the value of a single random variable.

  8. ^

    “Other things” including real-world facts about what variables we can practically intervene on.

  9. ^

    We know this thanks to the 2024 Ig-Nobel Laureate in demography, Saul J. Newmann: Supercentenarian and remarkable age records exhibit patterns indicative of clerical errors and pension fraud.

  10. ^

    Actually, monotonicity in all of is not necessary. The function needs only to be monotonic on the subset of from which the agent will choose its intervention.

No comments.