If you are continuing work in this vein, I’d be interested in you looking at how these dynamics relate to different Goodhart failure modes, as we expanded on here. I think that much of the problem relates to specific forms of failure, and that paying attention to those dynamics could be helpful. I also think they accelerate in the presence of multiple agents—and I think the framework I pointed to here might be useful.
I’m not sure I understand what you mean by “specific forms of failure.” Could you give me a more concrete example of how Goodhart relates to the ideas in this essay?
I think what you call grader-optimization is trivially about how a target diverges from the (unmeasured) true goal, which is adversarial goodhart (as defined in paper, especially how we defined Campbell’s Law, not the definition in the LW post.)
And the second paper’s taxonomy, in failure mode 3, lays out how different forms of adversarial optimization in a multi-agent scenario relate to Goodhart’s law, in both goal poisoning and optimization theft cases—and both of these seem relevant to the questions you discussed in terms of grader-optimization.
This seems great!
If you are continuing work in this vein, I’d be interested in you looking at how these dynamics relate to different Goodhart failure modes, as we expanded on here. I think that much of the problem relates to specific forms of failure, and that paying attention to those dynamics could be helpful. I also think they accelerate in the presence of multiple agents—and I think the framework I pointed to here might be useful.
(Your second link is broken.)
Fixed—thanks!
I’m not sure I understand what you mean by “specific forms of failure.” Could you give me a more concrete example of how Goodhart relates to the ideas in this essay?
I think what you call grader-optimization is trivially about how a target diverges from the (unmeasured) true goal, which is adversarial goodhart (as defined in paper, especially how we defined Campbell’s Law, not the definition in the LW post.)
And the second paper’s taxonomy, in failure mode 3, lays out how different forms of adversarial optimization in a multi-agent scenario relate to Goodhart’s law, in both goal poisoning and optimization theft cases—and both of these seem relevant to the questions you discussed in terms of grader-optimization.