zeshen comments on Don’t design agents which exploit adversarial inputs

zeshen 21 Nov 2022 4:10 UTC
LW: 1 AF: 1
0
AF
I’m probably missing something, but doesn’t this just boil down to “misspecified goals lead to reward hacking”?
- TurnTrout 21 Nov 2022 20:29 UTC
  LW: 2 AF: 2
  0
  AF Parent
  Nope! Both “misspecified goals” and “reward hacking” are orthogonal to what I’m pointing at. The design patterns I highlight are broken IMO.
  - zeshen 24 Nov 2022 8:55 UTC
    LW: 1 AF: 1
    0
    AF Parent
    In every scenario, if you have a superintelligent actor which is optimizing the grader’s evaluations while searching over a large real-world plan space, the grader gets exploited.
    Similar to the evaluator-child who’s trying to win his mom’s approval by being close to the gym teacher, how would grader exploitation be different from specification gaming / reward hacking? In theory, wouldn’t a perfect grader solve the problem?
    - TurnTrout 26 Nov 2022 4:08 UTC
      LW: 2 AF: 2
      0
      AF Parent
      One point of this post is that specification gaming, as currently known, is an artifact of certain design patterns, which arise from motivating the agent (inner alignment) to optimize an objective over all possible plans or world states (outer alignment). These design patterns are avoidable, but AFAICT are enforced by common ways of thinking about alignment (e.g. many versions of outer alignment commit to robustly grading the agent on all plans it can consider). One hand (inner alignment) loads the shotgun, and our other hand (outer alignment) points it at our own feet and pulls the trigger.
      In theory, wouldn’t a perfect grader solve the problem?
      Yes, in theory. In practice, I think the answer is “no”, for reasons outlined in this post.
      - zeshen 26 Nov 2022 10:58 UTC
        1 point
        0
        Parent
        Thanks for the explanation!