paulfchristiano comments on AI Safety via Debate

paulfchristiano 11 May 2018 15:40 UTC
LW: 4 AF: 2
AF
In the reverse direction amplification mostly seems less adversarial since it’s pure supervised learning
Note that you could do amplification with either supervised learning or imitation or RL as the distillation step, in the long run I imagine using imitation+RL, which brings it closer to debate.
Wei Dai asks:
Let me see if I understand this correctly. Suppose the task is to build a strong cryptosystem. One of the subtasks would be to try to break a candidate. With Amplification+SL, the overseer would have to know how to build a tree to do that which seems to imply he has to be an expert cryptanalyst, and even then we’d be limited to known cryptanalytic approaches, unless he knows how to use Amplification to invent new cryptanalytic ideas. With either Debate or Amplification+RL, on the other hand, the judge/overseer only has to be able to recognize a successful attack, which seems much easier. Does this match what you’re thinking?
I don’t see why building a tree to break a cryptosystem requires being an expert cryptanalyst.
Indeed, amplification with SL can just directly copy RL (with roughly the same computational complexity), by breaking down task X into the subtasks:
- Solve task X.
- Solve task X.
- Generate a random candidate solution.
- Evaluate each of those three proposals and take the best one.
This breakdown is dangerous in the same way that RL is dangerous, and we’d like an alternative breakdown that doesn’t potentially introduce incorrigible/misaligned optimization. You might think *that* requires being an expert cryptanalyst, but again I don’t see the argument.
I do agree that there exist cases where “try stuff and see what works” is the only viable strategy, because there isn’t other internal structure that can be leveraged. In these cases it seems like we need to be able to use amplification to “sanitize” the stuff we try, which may act on the generating process or in the evaluation or in a more complicated way. That needs to be done whether we are doing RL directly or doing it inside of the amplification tree.
- Wei Dai 11 May 2018 17:30 UTC
  LW: 4 AF: 2
  AF Parent
  At this point I don’t understand how imitation+RL brings Amplification closer to Debate, or what is the advantage of using imitation+RL vs using supervised learning. I thought that perhaps it lets us train the AI using an overseer with less expertise (like in my example), but if that’s not the case, can you explain a bit more?
  - paulfchristiano 13 May 2018 19:43 UTC
    LW: 6 AF: 3
    AF Parent
    don’t understand how imitation+RL brings Amplification closer to Debate
    The default setup for amplification with RL is:
    Your AI samples two answers to a question.
    The human evaluates which one of them is better. The AI’s objective is to sample answers that are most likely to be marked as “better.”
    In order to evaluate which answer is better, the human asks the AI subquestions.
    This is very similar to debate. The most salient difference is that in the case of amplification, the subanswers are recursively evaluated in the same way as the original answer (i.e. the AI is trying to optimize the probability that their answer would be picked as the better answer, if that subquestion had been chosen as the top-level question). In debate, we have two AIs competing, and each subanswer is generated in order to support one of the original answers / to produce a coherent narrative in combination with one of the original answers.
    (There are a bunch of other incidental differences, e.g. is the process driven by the judge or by the debaters, but this doesn’t really matter given that you can ask questions like “What subquestion should I ask next?”)
    The main advantage of debate, as I see it, is as a mechanism for choosing choosing which subquestions to train on. That is, if there is an error buried somewhere deep in the amplification tree, it may never be visited by the amplification training process. But a strategic debater could potentially steer the tree towards that error, if they treat the entire debate as an RL process. (This was my main argument in favor of debates in 2015.)
    what is the advantage of using imitation+RL vs using supervised learning
    Using supervised learning for imitation, over large action spaces, doesn’t seem like a good idea:
    Exactly imitating an expert’s behavior is generally much harder than simply solving the task that the expert is solving.
    If you don’t have enough capacity to exactly imitate, then it’s not clear why the approximation should maintain the desirable properties of the original process. For example, if I approximately imitate a trajectory that causes a robot to pick up a glass, there is no particular reason the approximation should successfully pick up the glass. But in the amplification setting (and even in realistic settings with human experts today) you are never going to have enough capacity to exactly imitate.
    If you use an autoregressive model (or equivalently break down a large action into a sequence of binary choices), then you the model needs to be able to answer questions like “What should the nth bit of my answer be, given the first n-1 bits?” Those questions might be harder than simply sampling an entire answer.
    So to get around this, I think you either need a better approach to imitation learning (e.g. here is a proposal) or you need to add in RL.
    I think the only reason we’d want to avoid imitation+RL is because informed oversight might be challenging, and that might make it too hard to construct an adequate reward function. You could hope to avoid that with a careful imitation learning objective (e.g. by replacing the GAN in the “mimicry and meeting halfway” post with an appropriately constructed bidirectional GAN).
    I haven’t been thinking about non-RL approaches because it seems like we need to solve informed oversight anyway, as an input into any of these approaches to avoiding malign failure. So I don’t really see any upside from avoiding imitation+RL at the moment.