In every scenario, if you have a superintelligent actor which is optimizing the grader’s evaluations while searching over a large real-world plan space, the grader gets exploited.
Similar to the evaluator-child who’s trying to win his mom’s approval by being close to the gym teacher, how would grader exploitation be different from specification gaming / reward hacking? In theory, wouldn’t a perfect grader solve the problem?
One point of this post is that specification gaming, as currently known, is an artifact of certain design patterns, which arise from motivating the agent (inner alignment) to optimize an objective over all possible plans or world states (outer alignment). These design patterns are avoidable, but AFAICT are enforced by common ways of thinking about alignment (e.g. many versions of outer alignment commit to robustly grading the agent on all plans it can consider). One hand (inner alignment) loads the shotgun, and our other hand (outer alignment) points it at our own feet and pulls the trigger.
In theory, wouldn’t a perfect grader solve the problem?
Yes, in theory. In practice, I think the answer is “no”, for reasons outlined in this post.
I’m probably missing something, but doesn’t this just boil down to “misspecified goals lead to reward hacking”?
Nope! Both “misspecified goals” and “reward hacking” are orthogonal to what I’m pointing at. The design patterns I highlight are broken IMO.
Similar to the evaluator-child who’s trying to win his mom’s approval by being close to the gym teacher, how would grader exploitation be different from specification gaming / reward hacking? In theory, wouldn’t a perfect grader solve the problem?
One point of this post is that specification gaming, as currently known, is an artifact of certain design patterns, which arise from motivating the agent (inner alignment) to optimize an objective over all possible plans or world states (outer alignment). These design patterns are avoidable, but AFAICT are enforced by common ways of thinking about alignment (e.g. many versions of outer alignment commit to robustly grading the agent on all plans it can consider). One hand (inner alignment) loads the shotgun, and our other hand (outer alignment) points it at our own feet and pulls the trigger.
Yes, in theory. In practice, I think the answer is “no”, for reasons outlined in this post.
Thanks for the explanation!