I think that the problem is that none of the graders are actually embodying goals. If you align the agent to some ensemble of graders, you’re still building a system which runs computations at cross-purposes, where part of the system (the actor) is trying to trick and part (each individual grader) is trying to not be tricked.
I think that the problem is that none of the graders are actually embodying goals. If you align the agent to some ensemble of graders, you’re still building a system which runs computations at cross-purposes, where part of the system (the actor) is trying to trick and part (each individual grader) is trying to not be tricked.
In this situation, I would look for a way of looking at alignment such that this unnatural problem disappears. A different design pattern must exist, insofar as people are not optimizing for the outputs of little graders in their own heads.