[Posting for Geoffrey Irving, who doesn’t have a LW account.]
It’s too early to really distinguish amplification vs. debate in practice. This is mentioned briefly in the paper, but the self play structure of debate can make shorter arguments suffice (O(1)-depth debates vs. O(log n)-depth amplification trees), which is equivalently because debate can work with high branching factor trees. The intuitive content there is that it may be hard for a human to ask a subquestion that reveals a flaw in an amplification tree. In the reverse direction amplification mostly seems less adversarial since it’s pure supervised learning, but neither Paul or I are happy to lean very hard on that feature of amplification. That is: if you have to lean on amplification being trained via supervised learning the argument can’t say that amplification has no bad global minima.
I share your lack of satisfaction of what near optimal debates look like; this is roughly Section 5.5. Lots of thought and experiment required here. One note is that any understanding of what a near optimal debate should look like may be translatable into a better judge (better instructions for the judge, for example), so such understanding would likely result in concrete actions rather than just more or less confidence. In thinking through what real debates would look like, I think it’s quite useful to focus on very short debates (2-4 steps, say) like the (overly simple) vacation example, but replace the counterarguments with something more realistic. Note that conditional on the agents giving the same true answer up front, the structure of the game after that point doesn’t matter, so the important questions are what happens in a neighborhood around that point and whether such ties are some sort of attractor (so that agents would want to say the same thing even if it’s slightly wrong).
Adversarial debate is used quite often in human settings. However, I think it’s sensible to be optimistic that we can design better debates than a lot of the human cases, which are typically bogged down by inadequate equilibria (the rules for legal debates are written by lawyers, the rules for academic debates are written by tenured professors, etc.).
I agree with the complexity analogy point about security. The “bias towards answers resolvable with short debates” is true, though you could also call it “bias towards answers that we can be confident of”. It’s unclear which framing is more accurate at this point. The important question is whether that leads to misleading accurate answers or less precise answers (including just “I don’t know.”).
In the reverse direction amplification mostly seems less adversarial since it’s pure supervised learning
Note that you could do amplification with either supervised learning or imitation or RL as the distillation step, in the long run I imagine using imitation+RL, which brings it closer to debate.
Wei Dai asks:
Let me see if I understand this correctly. Suppose the task is to build a strong cryptosystem. One of the subtasks would be to try to break a candidate. With Amplification+SL, the overseer would have to know how to build a tree to do that which seems to imply he has to be an expert cryptanalyst, and even then we’d be limited to known cryptanalytic approaches, unless he knows how to use Amplification to invent new cryptanalytic ideas. With either Debate or Amplification+RL, on the other hand, the judge/overseer only has to be able to recognize a successful attack, which seems much easier. Does this match what you’re thinking?
I don’t see why building a tree to break a cryptosystem requires being an expert cryptanalyst.
Indeed, amplification with SL can just directly copy RL (with roughly the same computational complexity), by breaking down task X into the subtasks:
Solve task X.
Solve task X.
Generate a random candidate solution.
Evaluate each of those three proposals and take the best one.
This breakdown is dangerous in the same way that RL is dangerous, and we’d like an alternative breakdown that doesn’t potentially introduce incorrigible/misaligned optimization. You might think *that* requires being an expert cryptanalyst, but again I don’t see the argument.
I do agree that there exist cases where “try stuff and see what works” is the only viable strategy, because there isn’t other internal structure that can be leveraged. In these cases it seems like we need to be able to use amplification to “sanitize” the stuff we try, which may act on the generating process or in the evaluation or in a more complicated way. That needs to be done whether we are doing RL directly or doing it inside of the amplification tree.
At this point I don’t understand how imitation+RL brings Amplification closer to Debate, or what is the advantage of using imitation+RL vs using supervised learning. I thought that perhaps it lets us train the AI using an overseer with less expertise (like in my example), but if that’s not the case, can you explain a bit more?
don’t understand how imitation+RL brings Amplification closer to Debate
The default setup for amplification with RL is:
Your AI samples two answers to a question.
The human evaluates which one of them is better. The AI’s objective is to sample answers that are most likely to be marked as “better.”
In order to evaluate which answer is better, the human asks the AI subquestions.
This is very similar to debate. The most salient difference is that in the case of amplification, the subanswers are recursively evaluated in the same way as the original answer (i.e. the AI is trying to optimize the probability that their answer would be picked as the better answer, if that subquestion had been chosen as the top-level question). In debate, we have two AIs competing, and each subanswer is generated in order to support one of the original answers / to produce a coherent narrative in combination with one of the original answers.
(There are a bunch of other incidental differences, e.g. is the process driven by the judge or by the debaters, but this doesn’t really matter given that you can ask questions like “What subquestion should I ask next?”)
The main advantage of debate, as I see it, is as a mechanism for choosing choosing which subquestions to train on. That is, if there is an error buried somewhere deep in the amplification tree, it may never be visited by the amplification training process. But a strategic debater could potentially steer the tree towards that error, if they treat the entire debate as an RL process. (This was my main argument in favor of debates in 2015.)
what is the advantage of using imitation+RL vs using supervised learning
Using supervised learning for imitation, over large action spaces, doesn’t seem like a good idea:
Exactly imitating an expert’s behavior is generally much harder than simply solving the task that the expert is solving.
If you don’t have enough capacity to exactly imitate, then it’s not clear why the approximation should maintain the desirable properties of the original process. For example, if I approximately imitate a trajectory that causes a robot to pick up a glass, there is no particular reason the approximation should successfully pick up the glass. But in the amplification setting (and even in realistic settings with human experts today) you are never going to have enough capacity to exactly imitate.
If you use an autoregressive model (or equivalently break down a large action into a sequence of binary choices), then you the model needs to be able to answer questions like “What should the nth bit of my answer be, given the first n-1 bits?” Those questions might be harder than simply sampling an entire answer.
So to get around this, I think you either need a better approach to imitation learning (e.g. here is a proposal) or you need to add in RL.
I think the only reason we’d want to avoid imitation+RL is because informed oversight might be challenging, and that might make it too hard to construct an adequate reward function. You could hope to avoid that with a careful imitation learning objective (e.g. by replacing the GAN in the “mimicry and meeting halfway” post with an appropriately constructed bidirectional GAN).
I haven’t been thinking about non-RL approaches because it seems like we need to solve informed oversight anyway, as an input into any of these approaches to avoiding malign failure. So I don’t really see any upside from avoiding imitation+RL at the moment.
[Posting for Geoffrey Irving, who doesn’t have a LW account.]
It’s too early to really distinguish amplification vs. debate in practice. This is mentioned briefly in the paper, but the self play structure of debate can make shorter arguments suffice (O(1)-depth debates vs. O(log n)-depth amplification trees), which is equivalently because debate can work with high branching factor trees. The intuitive content there is that it may be hard for a human to ask a subquestion that reveals a flaw in an amplification tree. In the reverse direction amplification mostly seems less adversarial since it’s pure supervised learning, but neither Paul or I are happy to lean very hard on that feature of amplification. That is: if you have to lean on amplification being trained via supervised learning the argument can’t say that amplification has no bad global minima.
I share your lack of satisfaction of what near optimal debates look like; this is roughly Section 5.5. Lots of thought and experiment required here. One note is that any understanding of what a near optimal debate should look like may be translatable into a better judge (better instructions for the judge, for example), so such understanding would likely result in concrete actions rather than just more or less confidence. In thinking through what real debates would look like, I think it’s quite useful to focus on very short debates (2-4 steps, say) like the (overly simple) vacation example, but replace the counterarguments with something more realistic. Note that conditional on the agents giving the same true answer up front, the structure of the game after that point doesn’t matter, so the important questions are what happens in a neighborhood around that point and whether such ties are some sort of attractor (so that agents would want to say the same thing even if it’s slightly wrong).
Adversarial debate is used quite often in human settings. However, I think it’s sensible to be optimistic that we can design better debates than a lot of the human cases, which are typically bogged down by inadequate equilibria (the rules for legal debates are written by lawyers, the rules for academic debates are written by tenured professors, etc.).
I agree with the complexity analogy point about security. The “bias towards answers resolvable with short debates” is true, though you could also call it “bias towards answers that we can be confident of”. It’s unclear which framing is more accurate at this point. The important question is whether that leads to misleading accurate answers or less precise answers (including just “I don’t know.”).
Note that you could do amplification with either supervised learning or imitation or RL as the distillation step, in the long run I imagine using imitation+RL, which brings it closer to debate.
Wei Dai asks:
I don’t see why building a tree to break a cryptosystem requires being an expert cryptanalyst.
Indeed, amplification with SL can just directly copy RL (with roughly the same computational complexity), by breaking down task X into the subtasks:
Solve task X.
Solve task X.
Generate a random candidate solution.
Evaluate each of those three proposals and take the best one.
This breakdown is dangerous in the same way that RL is dangerous, and we’d like an alternative breakdown that doesn’t potentially introduce incorrigible/misaligned optimization. You might think *that* requires being an expert cryptanalyst, but again I don’t see the argument.
I do agree that there exist cases where “try stuff and see what works” is the only viable strategy, because there isn’t other internal structure that can be leveraged. In these cases it seems like we need to be able to use amplification to “sanitize” the stuff we try, which may act on the generating process or in the evaluation or in a more complicated way. That needs to be done whether we are doing RL directly or doing it inside of the amplification tree.
At this point I don’t understand how imitation+RL brings Amplification closer to Debate, or what is the advantage of using imitation+RL vs using supervised learning. I thought that perhaps it lets us train the AI using an overseer with less expertise (like in my example), but if that’s not the case, can you explain a bit more?
The default setup for amplification with RL is:
Your AI samples two answers to a question.
The human evaluates which one of them is better. The AI’s objective is to sample answers that are most likely to be marked as “better.”
In order to evaluate which answer is better, the human asks the AI subquestions.
This is very similar to debate. The most salient difference is that in the case of amplification, the subanswers are recursively evaluated in the same way as the original answer (i.e. the AI is trying to optimize the probability that their answer would be picked as the better answer, if that subquestion had been chosen as the top-level question). In debate, we have two AIs competing, and each subanswer is generated in order to support one of the original answers / to produce a coherent narrative in combination with one of the original answers.
(There are a bunch of other incidental differences, e.g. is the process driven by the judge or by the debaters, but this doesn’t really matter given that you can ask questions like “What subquestion should I ask next?”)
The main advantage of debate, as I see it, is as a mechanism for choosing choosing which subquestions to train on. That is, if there is an error buried somewhere deep in the amplification tree, it may never be visited by the amplification training process. But a strategic debater could potentially steer the tree towards that error, if they treat the entire debate as an RL process. (This was my main argument in favor of debates in 2015.)
Using supervised learning for imitation, over large action spaces, doesn’t seem like a good idea:
Exactly imitating an expert’s behavior is generally much harder than simply solving the task that the expert is solving.
If you don’t have enough capacity to exactly imitate, then it’s not clear why the approximation should maintain the desirable properties of the original process. For example, if I approximately imitate a trajectory that causes a robot to pick up a glass, there is no particular reason the approximation should successfully pick up the glass. But in the amplification setting (and even in realistic settings with human experts today) you are never going to have enough capacity to exactly imitate.
If you use an autoregressive model (or equivalently break down a large action into a sequence of binary choices), then you the model needs to be able to answer questions like “What should the nth bit of my answer be, given the first n-1 bits?” Those questions might be harder than simply sampling an entire answer.
So to get around this, I think you either need a better approach to imitation learning (e.g. here is a proposal) or you need to add in RL.
I think the only reason we’d want to avoid imitation+RL is because informed oversight might be challenging, and that might make it too hard to construct an adequate reward function. You could hope to avoid that with a careful imitation learning objective (e.g. by replacing the GAN in the “mimicry and meeting halfway” post with an appropriately constructed bidirectional GAN).
I haven’t been thinking about non-RL approaches because it seems like we need to solve informed oversight anyway, as an input into any of these approaches to avoiding malign failure. So I don’t really see any upside from avoiding imitation+RL at the moment.