I wrote this post in part with the hope that others would realise you can take the abstraction Scott offered and apply it to concrete domains of interest. I’d be excited to read posts using this on other domains, like Goodhart Taxonomy: Bayesianism or Goodhart Taxonomy: Management, to make up two examples. Anyway, it’s also good to report what it was like on the concrete level, so here are some parts I found hard.
On Power
I have no idea which type of goodharting power really is. Obviously adversarial is where if you have power, this causes folks to deceive you, but there’s a less explicit type whereby people aren’t explicitly hostile and yet you observe that increasing power causes them to submit to you more in arguments. The world where others are explicitly optimising on you feels adversarial, but the world where you notice that acting higher status leads to people agreeing with you more feels regressional, because you’re grabbing power but not for adversarial reasons, just because you found a noise variable. But I’m really not sure.
I originally picked adversarial, then causal, then regressional, and finally back to adversarial. Does someone want to argue to me that it’s extremal?
On Regressional vs Extremal
(Note that I wrote this when power was in regressional. I’ll leave it here in case it’s useful to anyone.)
It can seem at first like the difference between regressional goodharting and extremal goodharting is merely quantitative, not qualitative. Someone trusting you a lot is maybe part of the noise under regressional goodhart, and is extremal goodhart only when it overwhelms.
I think the key distinction here is that regressional goodhart is over the normal range of experience, and extremal is only noticeable outside of that range.
I feel like I can map them all between the two types without difficulty. In this list, Regressional is marked ‘R’ and Extremal is marked ‘E’.
R: Shared background models → E: An identical copy of you
R: Power → E: A dictator of a small country
It seems clear to me that, if you said “I’m going to think about who agrees with me the most on the important issues” it would be useful to say “Take care not to regressional goodhart on that. Don’t over-weight agreement with people who already have a lot of similar cognitive tools to you, or people you have power over”, but that saying “Be careful not to accidentally become a dictator of a small country or create an identical copy of yourself” would be unhelpful. Yet they are the same type of error, just with a whopping quantitative distinction between them.
I admit that I’m claiming a distinction based in subjective qualities (i.e. the fact we are humans, not any other facts about math), but that doesn’t seem reason enough to do away with the distinction in the language.
On Causal vs Regressional
I feel that many of the examples in causal could move back to regressional, but I’m not sure why that is. Trying to say things more like ‘I agree with you’ or discounting processes that take longer to reach agreement as ‘bad processes’ because those questions were just deeper and more nuanced, feel like optimising for noise variables. /shrug
Conclusion
Overall, I didn’t come into the post knowing what argument I was going to make, just with the intent to see what each model implied. Extremal ended up being the mosts surprising (and the least useful, on average, though ‘yes-men’ was a good thing to notice). Causal ended up being the most useful—for ages it was just a long list of examples, then I noticed that they split into three groups. I’m pretty happy with the result, it clarified my thinking on the topic a bunch.
Signal is a victim of Regressional goodheart if it’s a good indicator of thing you care about but stops working the moment you start optimizing for it. for example empty email inbox is a decent signal for me having taken care of things I need to do but if I do the obvious optimization and set up a filter to delete all incoming email...
Signal is a victim of Extreme Goodheart if optimizing for it works well. Until you get to values that are very high when it suddenly sops working. For example if I increased fraction of time I spend exercising I expect I would keep getting healthier for a while. But I expect that if was exercising so much I’d need to take stimulants to be able to move time from sleep to exercise I’d be hurting myself.
So I think those are mutually exclusive (you can’t have optimizing signal both fail immediately and keep working for a while).
A few thoughts on using the Goodhart Taxonomy
I wrote this post in part with the hope that others would realise you can take the abstraction Scott offered and apply it to concrete domains of interest. I’d be excited to read posts using this on other domains, like Goodhart Taxonomy: Bayesianism or Goodhart Taxonomy: Management, to make up two examples. Anyway, it’s also good to report what it was like on the concrete level, so here are some parts I found hard.
On Power
I have no idea which type of goodharting power really is. Obviously adversarial is where if you have power, this causes folks to deceive you, but there’s a less explicit type whereby people aren’t explicitly hostile and yet you observe that increasing power causes them to submit to you more in arguments. The world where others are explicitly optimising on you feels adversarial, but the world where you notice that acting higher status leads to people agreeing with you more feels regressional, because you’re grabbing power but not for adversarial reasons, just because you found a noise variable. But I’m really not sure.
I originally picked adversarial, then causal, then regressional, and finally back to adversarial. Does someone want to argue to me that it’s extremal?
On Regressional vs Extremal
(Note that I wrote this when power was in regressional. I’ll leave it here in case it’s useful to anyone.)
It can seem at first like the difference between regressional goodharting and extremal goodharting is merely quantitative, not qualitative. Someone trusting you a lot is maybe part of the noise under regressional goodhart, and is extremal goodhart only when it overwhelms.
I think the key distinction here is that regressional goodhart is over the normal range of experience, and extremal is only noticeable outside of that range.
I feel like I can map them all between the two types without difficulty. In this list, Regressional is marked ‘R’ and Extremal is marked ‘E’.
R: Shared background models → E: An identical copy of you
R: Power → E: A dictator of a small country
It seems clear to me that, if you said “I’m going to think about who agrees with me the most on the important issues” it would be useful to say “Take care not to regressional goodhart on that. Don’t over-weight agreement with people who already have a lot of similar cognitive tools to you, or people you have power over”, but that saying “Be careful not to accidentally become a dictator of a small country or create an identical copy of yourself” would be unhelpful. Yet they are the same type of error, just with a whopping quantitative distinction between them.
I admit that I’m claiming a distinction based in subjective qualities (i.e. the fact we are humans, not any other facts about math), but that doesn’t seem reason enough to do away with the distinction in the language.
On Causal vs Regressional
I feel that many of the examples in causal could move back to regressional, but I’m not sure why that is. Trying to say things more like ‘I agree with you’ or discounting processes that take longer to reach agreement as ‘bad processes’ because those questions were just deeper and more nuanced, feel like optimising for noise variables. /shrug
Conclusion
Overall, I didn’t come into the post knowing what argument I was going to make, just with the intent to see what each model implied. Extremal ended up being the mosts surprising (and the least useful, on average, though ‘yes-men’ was a good thing to notice). Causal ended up being the most useful—for ages it was just a long list of examples, then I noticed that they split into three groups. I’m pretty happy with the result, it clarified my thinking on the topic a bunch.
I understood Regressional vs Extremal like that:
Signal is a victim of Regressional goodheart if it’s a good indicator of thing you care about but stops working the moment you start optimizing for it. for example empty email inbox is a decent signal for me having taken care of things I need to do but if I do the obvious optimization and set up a filter to delete all incoming email...
Signal is a victim of Extreme Goodheart if optimizing for it works well. Until you get to values that are very high when it suddenly sops working. For example if I increased fraction of time I spend exercising I expect I would keep getting healthier for a while. But I expect that if was exercising so much I’d need to take stimulants to be able to move time from sleep to exercise I’d be hurting myself.
So I think those are mutually exclusive (you can’t have optimizing signal both fail immediately and keep working for a while).