Could part of the problem be that the actor is optimizing against a single grader’s evaluations? Shouldn’t it somehow take uncertainty into account?
Consider having an ensemble of graders, each learning or having been trained to evaluate plans/actions from different initializations and/or using different input information. Each grader would have a different perspective, but that means that the ensemble should converge on similar evaluations for plans that look similarly good from many points of view (like a CT image crystallizing from the combination of many projections).
Rather than arg-maxing on the output of a single grader, the actor would optimize for Schelling points in plan space, selecting actions that minimize the variance among all graders. Of course, you still want it to maximize the evaluations also, so maybe it should look for actions that lie somewhere in the middle of the Pareto frontier of maximum E[evaluation]ensemble and minimum Var[evaluation]ensemble.
My intuition suggests that the larger and more diverse the ensemble, the better this strategy would perform, assuming the evaluators are all trained properly. However, I suspect a superintelligence could still find a way to exploit this.
I think that the problem is that none of the graders are actually embodying goals. If you align the agent to some ensemble of graders, you’re still building a system which runs computations at cross-purposes, where part of the system (the actor) is trying to trick and part (each individual grader) is trying to not be tricked.
This relates closely to how to “solve” Goodhart problems in general. Multiple metrics / graders make exploitation more complex, but have other drawbacks. I discussed the different approaches in my paper here, albeit in the realm of social dynamics rather than AI safety.
Could part of the problem be that the actor is optimizing against a single grader’s evaluations? Shouldn’t it somehow take uncertainty into account?
Consider having an ensemble of graders, each learning or having been trained to evaluate plans/actions from different initializations and/or using different input information. Each grader would have a different perspective, but that means that the ensemble should converge on similar evaluations for plans that look similarly good from many points of view (like a CT image crystallizing from the combination of many projections).
Rather than arg-maxing on the output of a single grader, the actor would optimize for Schelling points in plan space, selecting actions that minimize the variance among all graders. Of course, you still want it to maximize the evaluations also, so maybe it should look for actions that lie somewhere in the middle of the Pareto frontier of maximum E[evaluation]ensemble and minimum Var[evaluation]ensemble.
My intuition suggests that the larger and more diverse the ensemble, the better this strategy would perform, assuming the evaluators are all trained properly. However, I suspect a superintelligence could still find a way to exploit this.
I think that the problem is that none of the graders are actually embodying goals. If you align the agent to some ensemble of graders, you’re still building a system which runs computations at cross-purposes, where part of the system (the actor) is trying to trick and part (each individual grader) is trying to not be tricked.
In this situation, I would look for a way of looking at alignment such that this unnatural problem disappears. A different design pattern must exist, insofar as people are not optimizing for the outputs of little graders in their own heads.
This relates closely to how to “solve” Goodhart problems in general. Multiple metrics / graders make exploitation more complex, but have other drawbacks. I discussed the different approaches in my paper here, albeit in the realm of social dynamics rather than AI safety.