Don’t want Goodhart? — Specify the variables more
In everyday life, if something looks good to a human, then it is probably actually good (i.e. that human would still think it’s good if they had more complete information and understanding). Obviously there are plenty of exceptions to this, but it works most of the time in day-to-day dealings. But if we start optimizing really hard to make things look good, then Goodhart’s Law kicks in. We end up with instagram food—an elaborate milkshake or salad or burger, visually arranged like a bouquet of flowers, but impractical to eat and kinda mediocre-tasting.
Why Agent Foundations? An Overly Abstract Explanation
I expect that the main problem with Goodhart’s law is that if you strive for an indicator to accurately reflect the state of the world, once the indicator becomes decoupled from the state of the world, it stops reflecting the changes in the world. This is how I interpret the term ‘good,’ which I dislike. People want a thermometer to accurately reflect the patterns they called temperature to better predict the future — if the thermometer doesn’t reflect the temperature, future predictions suffer.
Now I return to the burger example — suppose a neural network operator starts optimizing certain parameters to make a burger picture increase the café′s profit. Suppose there are several initially optimizable parameters — the recognizability of the burger’s image, the anticipated ‘sense of pleasure’ upon viewing, the presence of necessary ingredients, a non-irritating background, clear visibility of the image, and others. If we are solving the task of ‘increasing sales from a picture,’ we are not solving the problem of feeding the hungry; we are solving a narrower task — which means that optimizing the taste of the burger may not be needed for this task. For example, if we optimize for reducing the time spent on a task, we can neglect the efforts to fix one of the variables.
In this example, the task was not to create the most appealing burger and at the same time maximize the taste and convenience of consumption. That would be a different function.
If you indeed were solving a narrower task — that is, only creating the most sense of pleasure-inducing picture with maximization of other parameters — and then looked back, puzzled as to why the hungry weren’t fed by this procedure, bringing Goodhart’s law into the discussion is madness; it stresses me out. The variable ‘people are hungry’ wasn’t important for this task at all. Oh, or was it important to you? Then why didn’t you specify it? You think it’s ‘obvious’?
The hungry people in my analogy represent the variable ‘mediocrity of taste’ in the task of a ‘sense of pleasure-inducing picture.’ This is an extra variable for the original task. Why bring Goodhart’s law into this?
Original Goodhart’s Law: Any observed statistical regularity will tend to collapse once pressure is placed upon it for control purposes.
There’s no word ‘GOOD’ in it at all.
I have a hypothesis for why it was brought in — due to confusion with the word ‘good.’
‘If something looks good to a person, then it is probably truly good.’
Here, I interpret the umbrella term ‘good’ as human intuition that the burger will satisfy them on all essential parameters. But ‘looks good’ narrows our view to the variable ‘appearance.’ While ‘truly good’ I decipher as ‘I am satisfied with most important variables for my task, not just the variable “pleasant appearance.”’
My replacement now looks like this: If a person signals a high ‘looks good’ parameter, then it is likely that they will be satisfied with other parameters of the item if they learn their values.
By making such a translation, the statement becomes a testable hypothesis, and I think the statement ‘in most cases, it holds true in everyday life’ now crumbles as a reliable predictor. All I did was taboo the word ‘good.’ It will NOT hold true always, especially in cases where the optimization of appearance hides shortcomings in other parameters.
I expect the author would not have arrived at their original thesis if they had tabooed the word ‘good’ and replaced it with the variables they meant.
I expect that most people who wanted a real diamond that ‘looks nice’ and later found out it was fake would change their view of ‘good’ in most cases, not in the minority.
I remind you that in the original Goodhart’s Law, it was about the destruction of a static regularity if it ceases to be coupled with reality.
If an employee receives a reward for the number of cars sold each month, they will try to sell more cars even at a loss.
This scenario would not have occurred if the worker had maximized not only the variable ‘number of cars’ but also the variable ‘profit.’ This variable could have been included from the start. The condition of mandatory profit maximization would have complicated the ‘Goodhart’ on the number of cars.
There is no reason to be surprised if, in optimizing the task ‘sense of pleasure-inducing burger picture,’ you did not include the variables ‘physical pleasantness of the burger’s taste’ and ‘convenience of eating the physical burger’ — but if you did include them, I expect the problem would disappear because now they too start being optimized.
To solve Goodhart’s law in such scenarios, it’s enough to add more variables that you might have mentally put under the umbrella term ‘good,’ but forgot to include in the original optimization formula — and then are surprised why the variable you expected under ‘good’ wasn’t included — because you didn’t include it!
How to decide in advance which variables to add? — Spend cognitive resources (or use others’) to model what kind of horrifying stress awaits you in the future if the goal is met differently than you imagined and identify which variable changes would cancel it.
If a car-selling employer had modeled in advance that an employee would start optimizing the number of cars for salary, they would have added a new variable — profit. One reason they didn’t could be that they didn’t brainstorm this failure mode — then the answer is: brainstorm failure modes.
If you maximized politeness in GPT-4 during its design but noticed some ‘Goodhart,’ that is, GPT maximizes politeness in form, but you detect passive aggression or veiled insults? It’s your responsibility for cutting corners and hiding several implicit expectations about other variables behind the umbrella word ‘politeness,’ which GPT doesn’t know — think of those variables in advance, specify them better, since you’re such a reductionist afraid of Goodhart. This is a solvable problem, and as a result, adding more variables changes the outcome — so add more. No wonder you failed with ‘Goodhart.’ If you make requests like ‘do well.’
If someone comes to a pharmacy and says ‘give me a good medicine,’ it can be stated post-factum that they will only be satisfied if the medicine corresponds hidden variables 1, 5, 6, and 9. These four variables were placed into the word ‘good,’ and the seller must guess these variables from the context. But here’s the issue — the seller guessed ‘1 and 5’ but didn’t guess 6 and 9, and assumed 2 and 4. Are the universes different? Yes? Are the consequences different? Yes. To avoid this, variables are usually clarified.
If the buyer assumed that ‘good’ = 1, 5, 6, and 9 is COMMON KNOWLEDGE, then they were WRONG.
- You, the seller, Goodharted 1 and 5, but what about 6 and 9?
- Maybe you should have made 6 and 9 COMMON KNOWLEDGE?
- Well, it’s obvious that ‘good’ includes 6 and 9.
- THIS IS WHY (including) ALIGNMENT IS UNSOLVED!
I expect that many similar problems will be solved by removing the word ‘good’ altogether and replacing it with variables — and if you can’t replace it with variables now, then expect problems of this kind.
Make 6 and 9 common knowledge! LLM won’t PARSE your 6 and 9!
Are you too lazy to break it down into variables, wanting to save effort and just write ‘good’? Then accept your ‘Goodhart.’
I guess in real life, the reason (for leaving out the important variables) is a combination of:
ignorance
technical difficulties with measuring something
people not caring deeply (just doing their job in the easiest possible way)
legal reasons (it is not allowed to measure something)
To use an example of software developers being evaluated on how many lines of code they write:
the manager has often zero programming skills, they couldn’t tell good code from bad even if they tried, so they try to measure something they can understand
what is a “good code” actually? ask five programmers in your team, you will get five different opinions
the manager doesn’t really care about the quality of code, just tries to make their boss happy by making some kind of report
it would be really bad if it turned out that your diversity hire actually sucks at coding, so this way you at least provide them a chance to get good results on paper
Humans have been trying to parse out the hidden variables in the word “good” for millennia—plausibly since before the dawn of writing. We’ve made progress—a lot of it—but that doesn’t stop the remaining problem from being very hard, because yes, the word is something like an abstraction of an approximation of a loose cloud of ideas. We can parse out and taboo variables 1-9, and come to agree on them, and still caught off guard by variable 184 when it interacts with variables 125 and 4437, each of which is irrelevant (or an unchanging background) to most people in most contexts to the point we’ve never bothered to consciously notice them.
You talked about thermometers. You’re right, but consider that it took centuries to go from “We want to measure how hot or cold things are,” to “here’s the actual physical definition of temperature.” Even still it’s unintuitive, there’s no single instrument that works to measure it in all cases, and it doesn’t quite align with what every user of the word wants out of the concept. “Good” is a lot more complicated than “temperature.”
In other words, yes, of course, let’s keep doing this, more and better! But let’s be honest about it being actually hard and complicated.