If you have an expert, but don’t trust them to give you truthful information, how can you incentivize them to tell you the truth anyway? One option is to pay them every time they provide evidence that changes their mind, with the hope that only once you believe the truth will there be no evidence that can change your mind. This post proposes a similar scheme for AI alignment.
We train two models, M and Adv. Given a question Q, M is trained to predict what answer to Q the human will give at the end of the procedure. Adv on the other hand is trained to produce arguments that will most make M “change its mind”, i.e. output a substantially different distribution over answers than it previously outputted. M can then make a new prediction. This is repeated T times, and eventually the human is given all T outputs produced by Adv, and provides their final answer (which is used to provide a gradient signal for M). After training, we throw away Adv and simply use M as our question-answering system.
One way to think about this is that M is trained to provide a prediction market on “what the human will answer”, and Adv is trained to manipulate the market by providing new arguments that would change what the human says. So, once you see M providing a stable result, that should mean that the result is robust to any argument that Adv could provide, and so it is what the human would say after seeing all the arguments.
This scheme bears some resemblance to <@debate@>(@AI safety via debate@), and it can benefit from schemes that help debate, most notably <@cross-examination@>(@Writeup: Progress on AI Safety via Debate@). In particular, at every step Adv can cross-examine the previous incarnation of Adv. If the previous incarnation was deceptive, the current incarnation can demonstrate this to the human, which should cause them to disregard the previous argument. We can also add oversight, where an overseer with access to the model ensures that the model does not become non-myopic or deceptive.
Planned opinion (may change with more discussion above):
I like the simplicity of the idea “find the point at which the human no longer changes their mind”, and like that this is a new idea of how we can scale training of AI systems beyond human level performance. However, I’m not convinced that the training procedure given in this post would end up at this equilibrium, unless the human very specifically guided the training to do so (an assumption I don’t think we can usually make). It seems that if we were to reach the state where M stably reported the true answer to the question, then Adv would never get any reward—but Adv could do better by randomizing what arguments it makes, so that M cannot know which arguments H will be exposed to and so can’t stably predict H’s final answer. See more details in this thread.
Planned summary for the Alignment Newsletter:
Planned opinion (may change with more discussion above):