I’m not sure I understand what you mean by “specific forms of failure.” Could you give me a more concrete example of how Goodhart relates to the ideas in this essay?
I think what you call grader-optimization is trivially about how a target diverges from the (unmeasured) true goal, which is adversarial goodhart (as defined in paper, especially how we defined Campbell’s Law, not the definition in the LW post.)
And the second paper’s taxonomy, in failure mode 3, lays out how different forms of adversarial optimization in a multi-agent scenario relate to Goodhart’s law, in both goal poisoning and optimization theft cases—and both of these seem relevant to the questions you discussed in terms of grader-optimization.
I’m not sure I understand what you mean by “specific forms of failure.” Could you give me a more concrete example of how Goodhart relates to the ideas in this essay?
I think what you call grader-optimization is trivially about how a target diverges from the (unmeasured) true goal, which is adversarial goodhart (as defined in paper, especially how we defined Campbell’s Law, not the definition in the LW post.)
And the second paper’s taxonomy, in failure mode 3, lays out how different forms of adversarial optimization in a multi-agent scenario relate to Goodhart’s law, in both goal poisoning and optimization theft cases—and both of these seem relevant to the questions you discussed in terms of grader-optimization.