There are two related, but distinct problems under the heading of Goodhart, both caused by the hard-to-deny fact that any practical metric is only a proxy/estimate/correlate of the actual true goal.
1) anti-inductive behavior. When a metric gets well-known as a control input, mis-aligned agents can spoof their behavior to take advantage.
2) divergence of legible metric from true desire. Typically this increases over time, or over the experienced range of the metric.
I think you’re right that there is an underlying assumption that such a true goal exists. I don’t think there’s any reason to believe that it’s an understandable/legible function for human brains or technology. It could be polynomial or not, and it could have millions or billions of terms. In any actual human (and possibly in any embodied agent), it’s only partly specified, and even that partial specification isn’t fully accessible to introspection.
“it’s better to behave in ways that aren’t as subject to Goodhart’s Law,”
Specifics matter. It’s better to behave in ways that give better outcomes, but it’s not obvious at all what those ways are. Even ways that are known to be affected by goodhart’s law have SOME reason to believe they’re beneficial—goodhart isn’t (necessarily) a reversal of sign, only a loss of correlation with actual desires.
it can be very hard to explain why these things are true
Again, specifics matter. “Goodhart exists, and here’s how it might apply to your proposed metrics” has been an _easy_ discussion every time I’ve had it. I literally have never had a serious objection to the concept. What’s hard is figuring out checksum metrics or triggers to re-evaluate the control inputs. Defining the range and context inside of which a given measurement is “good enough” takes work, but is generally achievable (in the real world; perhaps not for theoretical AI work).
Strongly agree—and Goodhart’s law is at least 4 things. Though I’d note that anti-inductive behavior / metric gaming is hard to separate from goal mis-specification, for exactly the reasons outlined in the post.
But saying there is a goal too complex to be understandable and legible implies that it’s really complex, but coherent. I don’t think that’s the case of individuals, and I’m certain it isn’t true of groups. (Arrow’s theorem, etc.)
But saying there is a goal too complex to be understandable and legible implies that it’s really complex, but coherent
I’m not sure it’s possible to distinguish between chaotically-complex and incoherent. Once you add reference class problems in (you can’t step in the same river twice; no two decisions are exactly identical), there’s no difference between “inconsistent” and “unknown terms with large exponents on unmeasured variables”.
But in any case, even without coherence/consistency across agents or over time, any given decision can be an optimization of something.
[ I should probably add an epistemic status: not sure this is a useful model, but I do suspect there are areas it maps to the territory well. ]
I don’t think the model is useful, since it’s non-predictive. And we have good reasons to think that human brains are actually incoherent. Which means I’m skeptical that there is something useful to find by fitting a complex model to find a coherent fit for an incoherent system.
I think (1) Dagon is right that if we consider a purely behavioral perspective the distinction gets meaningless at the boundaries, trying to distinguish between highly complex values vs incoherence; any set of actions can be justified via some values; (2) humans are incoherent, in the sense that there are strong candidate partial specifications of our values (most of us like food and sex) and we’re not always the most sensible in how we go about achieving them; (3) also, to the extent that humans can be said to have values, they’re highly complex.
The thing that makes these three statements consistent is that we use more than just a behavioral lense to judge “human values”.
There are two related, but distinct problems under the heading of Goodhart, both caused by the hard-to-deny fact that any practical metric is only a proxy/estimate/correlate of the actual true goal.
1) anti-inductive behavior. When a metric gets well-known as a control input, mis-aligned agents can spoof their behavior to take advantage.
2) divergence of legible metric from true desire. Typically this increases over time, or over the experienced range of the metric.
I think you’re right that there is an underlying assumption that such a true goal exists. I don’t think there’s any reason to believe that it’s an understandable/legible function for human brains or technology. It could be polynomial or not, and it could have millions or billions of terms. In any actual human (and possibly in any embodied agent), it’s only partly specified, and even that partial specification isn’t fully accessible to introspection.
Specifics matter. It’s better to behave in ways that give better outcomes, but it’s not obvious at all what those ways are. Even ways that are known to be affected by goodhart’s law have SOME reason to believe they’re beneficial—goodhart isn’t (necessarily) a reversal of sign, only a loss of correlation with actual desires.
Again, specifics matter. “Goodhart exists, and here’s how it might apply to your proposed metrics” has been an _easy_ discussion every time I’ve had it. I literally have never had a serious objection to the concept. What’s hard is figuring out checksum metrics or triggers to re-evaluate the control inputs. Defining the range and context inside of which a given measurement is “good enough” takes work, but is generally achievable (in the real world; perhaps not for theoretical AI work).
Strongly agree—and Goodhart’s law is at least 4 things. Though I’d note that anti-inductive behavior / metric gaming is hard to separate from goal mis-specification, for exactly the reasons outlined in the post.
But saying there is a goal too complex to be understandable and legible implies that it’s really complex, but coherent. I don’t think that’s the case of individuals, and I’m certain it isn’t true of groups. (Arrow’s theorem, etc.)
I’m not sure it’s possible to distinguish between chaotically-complex and incoherent. Once you add reference class problems in (you can’t step in the same river twice; no two decisions are exactly identical), there’s no difference between “inconsistent” and “unknown terms with large exponents on unmeasured variables”.
But in any case, even without coherence/consistency across agents or over time, any given decision can be an optimization of something.
[ I should probably add an epistemic status: not sure this is a useful model, but I do suspect there are areas it maps to the territory well. ]
I’d agree with the epistemic warning ;)
I don’t think the model is useful, since it’s non-predictive. And we have good reasons to think that human brains are actually incoherent. Which means I’m skeptical that there is something useful to find by fitting a complex model to find a coherent fit for an incoherent system.
I think (1) Dagon is right that if we consider a purely behavioral perspective the distinction gets meaningless at the boundaries, trying to distinguish between highly complex values vs incoherence; any set of actions can be justified via some values; (2) humans are incoherent, in the sense that there are strong candidate partial specifications of our values (most of us like food and sex) and we’re not always the most sensible in how we go about achieving them; (3) also, to the extent that humans can be said to have values, they’re highly complex.
The thing that makes these three statements consistent is that we use more than just a behavioral lense to judge “human values”.