abramdemski comments on nostalgebraist: Recursive Goodhart’s Law

abramdemski 26 Aug 2020 18:12 UTC
11 points
The author seems to be skipping a step in their argument. I thought Goodhart’s Law was about how it’s hard to specify a measurable target which exactly matches your true goal, not that true goals don’t exist.
But this is the point. That’s why it was titled “recursive goodhart’s law”—the idea being that at any point where you explicitly point to a “true goal” and a “proxy”, you’ve probably actually written down two different proxies of differing quality. So you can keep trying to write down ever-more-faithful proxies, or you can “admit defeat” and attempt to make due without an explicitly written down function.
And the author explicitly admits that they don’t have a good way to convince people of this, so, yeah, they’re missing a step in their argument. They’re more saying some things that are true and less trying to convince.
As for whether it’s true—yeah, this is basically the whole value specification problem in AI alignment.
I agree that Goodhard isn’t just about “proxies”, it’s more specifically about “measurable proxies”, and the post isn’t really engaging with that aspect. But I think that’s fine. There’s also a Goodhart problem wrt proxies more generally.
- Davidmanheim 27 Aug 2020 10:44 UTC
  2 points
  Parent
  I talked about this in terms of “underspecified goals”—often, the true goal doesn’t usually exist clearly, and may not be coherent. Until that’s fixed, the problem isn’t really Goodhart, it’s just sucking at deciding what you want.
  I’m thinking of a young kid in a candy store who has $1, and wants everything, and can’t get it. What metric for choosing what to purchase will make them happy? Answer: There isn’t one. What they want is too unclear for them to be happy. So I can tell you in advance that they’re going to have a tantrum later about wanting to have done something else no matter what happens now. That’s not because they picked the wrong goal, it’s because their desires aren’t coherent.