The author seems to be skipping a step in their argument. I thought Goodhart’s Law was about how it’s hard to specify a measurable target which exactly matches your true goal, not that true goals don’t exist.
For example, if I wanted to donate to a COVID-19 charity, I might pick one with the measurable goal of reducing the official case numbers.. and they could spend all of their money bribing people to not report cases or make testing harder. Or if they’re an AI, they could hit this goal perfectly by killing all humans. But just because this goal (and probably all easily measurable goals) are Goodhartable doesn’t mean all possible goals are. The thing I actually want is still well defined (I want actual COVID-19 cases to decrease and I want the method to pass a filter defined by my brain), it’s just that the real fundamental thing I want is impossible to measure.
The author seems to be skipping a step in their argument. I thought Goodhart’s Law was about how it’s hard to specify a measurable target which exactly matches your true goal, not that true goals don’t exist.
But this is the point. That’s why it was titled “recursive goodhart’s law”—the idea being that at any point where you explicitly point to a “true goal” and a “proxy”, you’ve probably actually written down two different proxies of differing quality. So you can keep trying to write down ever-more-faithful proxies, or you can “admit defeat” and attempt to make due without an explicitly written down function.
And the author explicitly admits that they don’t have a good way to convince people of this, so, yeah, they’re missing a step in their argument. They’re more saying some things that are true and less trying to convince.
As for whether it’s true—yeah, this is basically the whole value specification problem in AI alignment.
I agree that Goodhard isn’t just about “proxies”, it’s more specifically about “measurable proxies”, and the post isn’t really engaging with that aspect. But I think that’s fine. There’s also a Goodhart problem wrt proxies more generally.
I talked about this in terms of “underspecified goals”—often, the true goal doesn’t usually exist clearly, and may not be coherent. Until that’s fixed, the problem isn’t really Goodhart, it’s just sucking at deciding what you want.
I’m thinking of a young kid in a candy store who has $1, and wants everything, and can’t get it. What metric for choosing what to purchase will make them happy? Answer: There isn’t one. What they want is too unclear for them to be happy. So I can tell you in advance that they’re going to have a tantrum later about wanting to have done something else no matter what happens now. That’s not because they picked the wrong goal, it’s because their desires aren’t coherent.
But “COVID-19 cases decreasing” is probably not your ultimate goal: more likely, it’s an instrumental goal for something like “prevent humans from dying” or “help society” or whatever… in other words, it’s a proxy for some other value. And if you walk back the chain of goals enough, you are likely to arrive at something that isn’t well defined anymore.
The author seems to be skipping a step in their argument. I thought Goodhart’s Law was about how it’s hard to specify a measurable target which exactly matches your true goal, not that true goals don’t exist.
For example, if I wanted to donate to a COVID-19 charity, I might pick one with the measurable goal of reducing the official case numbers.. and they could spend all of their money bribing people to not report cases or make testing harder. Or if they’re an AI, they could hit this goal perfectly by killing all humans. But just because this goal (and probably all easily measurable goals) are Goodhartable doesn’t mean all possible goals are. The thing I actually want is still well defined (I want actual COVID-19 cases to decrease and I want the method to pass a filter defined by my brain), it’s just that the real fundamental thing I want is impossible to measure.
But this is the point. That’s why it was titled “recursive goodhart’s law”—the idea being that at any point where you explicitly point to a “true goal” and a “proxy”, you’ve probably actually written down two different proxies of differing quality. So you can keep trying to write down ever-more-faithful proxies, or you can “admit defeat” and attempt to make due without an explicitly written down function.
And the author explicitly admits that they don’t have a good way to convince people of this, so, yeah, they’re missing a step in their argument. They’re more saying some things that are true and less trying to convince.
As for whether it’s true—yeah, this is basically the whole value specification problem in AI alignment.
I agree that Goodhard isn’t just about “proxies”, it’s more specifically about “measurable proxies”, and the post isn’t really engaging with that aspect. But I think that’s fine. There’s also a Goodhart problem wrt proxies more generally.
I talked about this in terms of “underspecified goals”—often, the true goal doesn’t usually exist clearly, and may not be coherent. Until that’s fixed, the problem isn’t really Goodhart, it’s just sucking at deciding what you want.
I’m thinking of a young kid in a candy store who has $1, and wants everything, and can’t get it. What metric for choosing what to purchase will make them happy? Answer: There isn’t one. What they want is too unclear for them to be happy. So I can tell you in advance that they’re going to have a tantrum later about wanting to have done something else no matter what happens now. That’s not because they picked the wrong goal, it’s because their desires aren’t coherent.
But “COVID-19 cases decreasing” is probably not your ultimate goal: more likely, it’s an instrumental goal for something like “prevent humans from dying” or “help society” or whatever… in other words, it’s a proxy for some other value. And if you walk back the chain of goals enough, you are likely to arrive at something that isn’t well defined anymore.