TurnTrout comments on Don’t design agents which exploit adversarial inputs

TurnTrout Nov 21, 2022, 8:07 PM
LW: 2 AF: 2
0
AF
I’m not sure I understand what you mean by “specific forms of failure.” Could you give me a more concrete example of how Goodhart relates to the ideas in this essay?
- Davidmanheim Nov 22, 2022, 6:41 PM
  LW: 2 AF: 1
  −1
  AF Parent
  I think what you call grader-optimization is trivially about how a target diverges from the (unmeasured) true goal, which is adversarial goodhart (as defined in paper, especially how we defined Campbell’s Law, not the definition in the LW post.)
  
  And the second paper’s taxonomy, in failure mode 3, lays out how different forms of adversarial optimization in a multi-agent scenario relate to Goodhart’s law, in both goal poisoning and optimization theft cases—and both of these seem relevant to the questions you discussed in terms of grader-optimization.
  What links here?
  - Don’t align agents to evaluations of plans by TurnTrout (Nov 26, 2022, 9:16 PM; 48 points)