Am I correct in saying that this suggests avoiding Goodhart’s law by using pass/fail grading? Or at least, by putting a maximum on artificial rewards, such that optimizing for the reward is senseless beyond that point?
Let’s take a common case of Goodhart’s law: teachers who are paid based on their students’ test scores. Imagine that teachers are either good or bad, and can either teach to the test (strategize) or not. Both true and measured performance are better on average for good teachers than for bad, but have some random variance. Meanwhile, true performance is better when teachers don’t strategize, but measured performance is better when they do.
If good teachers care to some degree about true performance, and you set an appropriate cutoff and payouts, the “quantilized” equilibrium will be that good teachers don’t strategize (since they’re relatively confident that they can pass the threshold without it), but bad teachers do (to maximize their chances of passing the threshold). Meanwhile, good teachers still get higher average payouts than bad teachers. This is probably better than the Goodhart case where you manage to pay good teachers a bigger bonus relative to bad teachers, but all teachers strategize to maximize their payout. So this formalization seems to make sense in this simple test case.
ETA: I was trying to succinctly formalize the example above and I got as far as (U~𝒩(μ(teacher)-δ*strategy,σ²); I=-2δ*strategy ) but that is taking I as the difference between the test score and the true utility, not separating out test scores from payouts, and I don’t want to write out all the complications that result from that so I quit. I hope that the words are enough to understand what I meant. Also I don’t know why I was doing that via unicode when I should have just used LaTeX.
At least typically, we’re talking about a strategy in the following sense. Q: Suppose you want to pick a teacher for a new classroom, how should you pick a teacher? A: you randomly sample from teachers above some performance threshold, in some base distribution. This works best given some fixed finite amount of “counterfeit performance” in that distribution.
If we treat the teachers as a bunch of agents, we don’t yet have a game-theoretic argument that we should actually expect the amount of counterfeit performance (I) to be bounded. It might be that all of the teachers exploit the metric as far as they can, and counterfeit performance is unbounded...
Fair enough. You were thinking about the problem from the point of view of hiring a teacher; when projecting it onto the problem from the point of a teacher deciding how to teach, I had to make additional assumptions not in the original post (ie, that “teachers care about true performance to some degree”).
Still, I think that putting it in concrete terms like this helped me understand (and agree with) the basic idea.
Am I correct in saying that this suggests avoiding Goodhart’s law by using pass/fail grading? Or at least, by putting a maximum on artificial rewards, such that optimizing for the reward is senseless beyond that point?
Let’s take a common case of Goodhart’s law: teachers who are paid based on their students’ test scores. Imagine that teachers are either good or bad, and can either teach to the test (strategize) or not. Both true and measured performance are better on average for good teachers than for bad, but have some random variance. Meanwhile, true performance is better when teachers don’t strategize, but measured performance is better when they do.
If good teachers care to some degree about true performance, and you set an appropriate cutoff and payouts, the “quantilized” equilibrium will be that good teachers don’t strategize (since they’re relatively confident that they can pass the threshold without it), but bad teachers do (to maximize their chances of passing the threshold). Meanwhile, good teachers still get higher average payouts than bad teachers. This is probably better than the Goodhart case where you manage to pay good teachers a bigger bonus relative to bad teachers, but all teachers strategize to maximize their payout. So this formalization seems to make sense in this simple test case.
ETA: I was trying to succinctly formalize the example above and I got as far as (U~𝒩(μ(teacher)-δ*strategy,σ²); I=-2δ*strategy ) but that is taking I as the difference between the test score and the true utility, not separating out test scores from payouts, and I don’t want to write out all the complications that result from that so I quit. I hope that the words are enough to understand what I meant. Also I don’t know why I was doing that via unicode when I should have just used LaTeX.
At least typically, we’re talking about a strategy in the following sense. Q: Suppose you want to pick a teacher for a new classroom, how should you pick a teacher? A: you randomly sample from teachers above some performance threshold, in some base distribution. This works best given some fixed finite amount of “counterfeit performance” in that distribution.
If we treat the teachers as a bunch of agents, we don’t yet have a game-theoretic argument that we should actually expect the amount of counterfeit performance (I) to be bounded. It might be that all of the teachers exploit the metric as far as they can, and counterfeit performance is unbounded...
I don’t fully understand the rest of the comment.
Fair enough. You were thinking about the problem from the point of view of hiring a teacher; when projecting it onto the problem from the point of a teacher deciding how to teach, I had to make additional assumptions not in the original post (ie, that “teachers care about true performance to some degree”).
Still, I think that putting it in concrete terms like this helped me understand (and agree with) the basic idea.