I like the basic idea, but I don’t understand the details, so by default won’t include it in the newsletter. Some confusions:
Are the arguments the same thing as answers? (I get this impression because you say “Is arg_t a sufficient answer to Q in context S?”.) If not, where is the answer in the debate? More generally I would benefit a lot from a concrete example (e.g. the Bali vs. Alaska example).
Debate sets up a game and argues that the equilibrium is truth-telling. It does that by setting up a zero-sum game and then using self-play for training; self-play will converge to the Nash equilibrium, so you are then justified in only analyzing the equilibrium, while ignoring the training process. However, in your use of debate, afaict nothing enforces that you converge to the equilibrium of the zero-sum game, so I don’t see why you gain the benefits of debate.
Why do you want to add an auxiliary RL objective? I normally imagine two reasons. First, maybe the task you want to solve is well suited to RL, e.g. Atari games, and so you want to train on that RL objective in addition to the question answering objective, so that the RL objective lets you learn good representations quickly. Second, if your model M is unable to do perfect imitation, there must be errors, and in this case the imitation objective doesn’t necessarily incentivize the right thing, whereas the RL objective does. (See Against Mimicry.) I think yours is aiming at the second and not the first?
The basic debate RL setup is meant to be unchanged here—when I say “the RL reward derived from winner” I mean that in the zero-sum debate game sense. So you’re still using self-play to converge on the Nash in the situation where you anneal towards debate, and otherwise you’re using that self-play RL reward as part of the loss and the supervised amplification loss as the other part.
Are the arguments the same thing as answers?
The arguments should include what each debater thinks the answer to the question should be.
I think yours is aiming at the second and not the first?
The basic debate RL setup is meant to be unchanged here—when I say “the RL reward derived from winner” I mean that in the zero-sum debate game sense.
But the answers are generated from pieces that involve humans, and those humans don’t behave as though they are in a zero-sum game?
I suppose you could imagine that the human is just some function, and the models are producing answers that get mapped through the function before they get their zero-sum reward… but then the equilibrium behavior could very well be different. For example, if you’re advising a human on how to play rock-paper-scissors, but they have a bias against paper and when you tell them to play paper they have a 33% chance of playing rock instead, you should now have a 50% chance of recommending paper, 33% chance of scissors, and 17% chance for rock. So I’m not sure that the reasons for optimism for debate transfer over into this setting where you have a human in the mix.
Maybe you could make an argument that for any H who we trust enough to do amplification / debate in the first place, this isn’t a problem, since Amp(H,M) is supposed to be more capable than M. Alternatively you could say that at the very least M is such that Amp(H,M) gives true and useful arguments, though that might conflict with training M to imitate Amp(H,M) (as in the rock-paper-scissors example above).
Yep; that’s basically how I’m thinking about this. Since I mostly want this process to limit to amplification rather than debate, I’m not that worried about the debate equilibrium not being exactly the same, though in most cases I expect in the limit that Amp(H,M)≈M such that you can in fact recover the debate equilibrium if you anneal towards debate.
I like the basic idea, but I don’t understand the details, so by default won’t include it in the newsletter. Some confusions:
Are the arguments the same thing as answers? (I get this impression because you say “Is arg_t a sufficient answer to Q in context S?”.) If not, where is the answer in the debate? More generally I would benefit a lot from a concrete example (e.g. the Bali vs. Alaska example).
Debate sets up a game and argues that the equilibrium is truth-telling. It does that by setting up a zero-sum game and then using self-play for training; self-play will converge to the Nash equilibrium, so you are then justified in only analyzing the equilibrium, while ignoring the training process. However, in your use of debate, afaict nothing enforces that you converge to the equilibrium of the zero-sum game, so I don’t see why you gain the benefits of debate.
Why do you want to add an auxiliary RL objective? I normally imagine two reasons. First, maybe the task you want to solve is well suited to RL, e.g. Atari games, and so you want to train on that RL objective in addition to the question answering objective, so that the RL objective lets you learn good representations quickly. Second, if your model M is unable to do perfect imitation, there must be errors, and in this case the imitation objective doesn’t necessarily incentivize the right thing, whereas the RL objective does. (See Against Mimicry.) I think yours is aiming at the second and not the first?
The basic debate RL setup is meant to be unchanged here—when I say “the RL reward derived from winner” I mean that in the zero-sum debate game sense. So you’re still using self-play to converge on the Nash in the situation where you anneal towards debate, and otherwise you’re using that self-play RL reward as part of the loss and the supervised amplification loss as the other part.
The arguments should include what each debater thinks the answer to the question should be.
Yep.
But the answers are generated from pieces that involve humans, and those humans don’t behave as though they are in a zero-sum game?
I suppose you could imagine that the human is just some function, and the models are producing answers that get mapped through the function before they get their zero-sum reward… but then the equilibrium behavior could very well be different. For example, if you’re advising a human on how to play rock-paper-scissors, but they have a bias against paper and when you tell them to play paper they have a 33% chance of playing rock instead, you should now have a 50% chance of recommending paper, 33% chance of scissors, and 17% chance for rock. So I’m not sure that the reasons for optimism for debate transfer over into this setting where you have a human in the mix.
Maybe you could make an argument that for any H who we trust enough to do amplification / debate in the first place, this isn’t a problem, since Amp(H,M) is supposed to be more capable than M. Alternatively you could say that at the very least M is such that Amp(H,M) gives true and useful arguments, though that might conflict with training M to imitate Amp(H,M) (as in the rock-paper-scissors example above).
Yep; that’s basically how I’m thinking about this. Since I mostly want this process to limit to amplification rather than debate, I’m not that worried about the debate equilibrium not being exactly the same, though in most cases I expect in the limit that Amp(H,M)≈M such that you can in fact recover the debate equilibrium if you anneal towards debate.