paulfchristiano comments on AI Safety via Debate

paulfchristiano 13 May 2018 19:43 UTC
LW: 6 AF: 3
AF
don’t understand how imitation+RL brings Amplification closer to Debate
The default setup for amplification with RL is:
- Your AI samples two answers to a question.
- The human evaluates which one of them is better. The AI’s objective is to sample answers that are most likely to be marked as “better.”
- In order to evaluate which answer is better, the human asks the AI subquestions.
This is very similar to debate. The most salient difference is that in the case of amplification, the subanswers are recursively evaluated in the same way as the original answer (i.e. the AI is trying to optimize the probability that their answer would be picked as the better answer, if that subquestion had been chosen as the top-level question). In debate, we have two AIs competing, and each subanswer is generated in order to support one of the original answers / to produce a coherent narrative in combination with one of the original answers.
(There are a bunch of other incidental differences, e.g. is the process driven by the judge or by the debaters, but this doesn’t really matter given that you can ask questions like “What subquestion should I ask next?”)
The main advantage of debate, as I see it, is as a mechanism for choosing choosing which subquestions to train on. That is, if there is an error buried somewhere deep in the amplification tree, it may never be visited by the amplification training process. But a strategic debater could potentially steer the tree towards that error, if they treat the entire debate as an RL process. (This was my main argument in favor of debates in 2015.)
what is the advantage of using imitation+RL vs using supervised learning
Using supervised learning for imitation, over large action spaces, doesn’t seem like a good idea:
- Exactly imitating an expert’s behavior is generally much harder than simply solving the task that the expert is solving.
- If you don’t have enough capacity to exactly imitate, then it’s not clear why the approximation should maintain the desirable properties of the original process. For example, if I approximately imitate a trajectory that causes a robot to pick up a glass, there is no particular reason the approximation should successfully pick up the glass. But in the amplification setting (and even in realistic settings with human experts today) you are never going to have enough capacity to exactly imitate.
- If you use an autoregressive model (or equivalently break down a large action into a sequence of binary choices), then you the model needs to be able to answer questions like “What should the nth bit of my answer be, given the first n-1 bits?” Those questions might be harder than simply sampling an entire answer.
So to get around this, I think you either need a better approach to imitation learning (e.g. here is a proposal) or you need to add in RL.
I think the only reason we’d want to avoid imitation+RL is because informed oversight might be challenging, and that might make it too hard to construct an adequate reward function. You could hope to avoid that with a careful imitation learning objective (e.g. by replacing the GAN in the “mimicry and meeting halfway” post with an appropriately constructed bidirectional GAN).
I haven’t been thinking about non-RL approaches because it seems like we need to solve informed oversight anyway, as an input into any of these approaches to avoiding malign failure. So I don’t really see any upside from avoiding imitation+RL at the moment.