TurnTrout comments on Don’t design agents which exploit adversarial inputs

TurnTrout 26 Nov 2022 4:08 UTC
LW: 2 AF: 2
0
AF
One point of this post is that specification gaming, as currently known, is an artifact of certain design patterns, which arise from motivating the agent (inner alignment) to optimize an objective over all possible plans or world states (outer alignment). These design patterns are avoidable, but AFAICT are enforced by common ways of thinking about alignment (e.g. many versions of outer alignment commit to robustly grading the agent on all plans it can consider). One hand (inner alignment) loads the shotgun, and our other hand (outer alignment) points it at our own feet and pulls the trigger.
In theory, wouldn’t a perfect grader solve the problem?
Yes, in theory. In practice, I think the answer is “no”, for reasons outlined in this post.
- zeshen 26 Nov 2022 10:58 UTC
  1 point
  0
  Parent
  Thanks for the explanation!