TurnTrout comments on Don’t design agents which exploit adversarial inputs

TurnTrout 21 Nov 2022 19:38 UTC
LW: 3 AF: 3
1
AF
I think that the problem is that none of the graders are actually embodying goals. If you align the agent to some ensemble of graders, you’re still building a system which runs computations at cross-purposes, where part of the system (the actor) is trying to trick and part (each individual grader) is trying to not be tricked.
In this situation, I would look for a way of looking at alignment such that this unnatural problem disappears. A different design pattern must exist, insofar as people are not optimizing for the outputs of little graders in their own heads.