Thomas Kwa comments on When is Goodhart catastrophic?

Thomas Kwa 3 Jun 2024 9:04 UTC
LW: 6 AF: 4
2
AF
We considered that “catastrophic” might have that connotation, but we couldn’t think of a better name and I still feel okay about it. Our intention with “catastrophic” was to echo the standard ML term of “catastrophic forgetting”, not a global catastrophe. In catastrophic forgetting the model completely forgets how to do task A after it is trained on task B, it doesn’t do A much worse than random. So we think that “catastrophic Goodhart” gives the correct idea to people who come from ML.
The natural question is then: why didn’t we study circumstances in which optimizing for a proxy gives you $- \infty$ utility in the limit? Because it isn’t true under the assumptions we are making. We wanted to study regressional Goodhart, and this naturally led us to the independence assumption. Previous work like Zhuang et al and Skalse et al has already formalized the extremal Goodhart / “use the atoms for something else” argument that optimizing for one goal would be bad for another goal, and we thought the more interesting part was showing that bad outcomes are possible even when error and utility are independent. Under the independence assumption, it isn’t possible to get less than 0 utility.
To get $- \infty$ utility in the frame where proxy = error + utility, you would need to assume something about the dependence between error and utility, and we couldn’t think of a simple assumption to make that didn’t have too many moving parts. I think extremal Goodhart is overall more important, but it’s not what we were trying to model.
Lastly, I think you’re imagining “average” outcome as a random policy, which is an agent incapable of doing significant harm. The utility of the universe is still positive because you can go about your life. But in a different frame, random is really bad. Right now we pretrain models and then apply RLHF (and hopefully soon, better alignment techniques). If our alignment techniques produce no more utility than the prior, this means the model is no more aligned than the base model, which is a bad outcome for OpenAI. Superintelligent models might be arbitrarily capable of doing things, so the prior might be better thought of as irreversibly putting the world in a random state, which is a global catastrophe.