The basic debate RL setup is meant to be unchanged here—when I say “the RL reward derived from winner” I mean that in the zero-sum debate game sense. So you’re still using self-play to converge on the Nash in the situation where you anneal towards debate, and otherwise you’re using that self-play RL reward as part of the loss and the supervised amplification loss as the other part.
Are the arguments the same thing as answers?
The arguments should include what each debater thinks the answer to the question should be.
I think yours is aiming at the second and not the first?
The basic debate RL setup is meant to be unchanged here—when I say “the RL reward derived from winner” I mean that in the zero-sum debate game sense.
But the answers are generated from pieces that involve humans, and those humans don’t behave as though they are in a zero-sum game?
I suppose you could imagine that the human is just some function, and the models are producing answers that get mapped through the function before they get their zero-sum reward… but then the equilibrium behavior could very well be different. For example, if you’re advising a human on how to play rock-paper-scissors, but they have a bias against paper and when you tell them to play paper they have a 33% chance of playing rock instead, you should now have a 50% chance of recommending paper, 33% chance of scissors, and 17% chance for rock. So I’m not sure that the reasons for optimism for debate transfer over into this setting where you have a human in the mix.
Maybe you could make an argument that for any H who we trust enough to do amplification / debate in the first place, this isn’t a problem, since Amp(H,M) is supposed to be more capable than M. Alternatively you could say that at the very least M is such that Amp(H,M) gives true and useful arguments, though that might conflict with training M to imitate Amp(H,M) (as in the rock-paper-scissors example above).
Yep; that’s basically how I’m thinking about this. Since I mostly want this process to limit to amplification rather than debate, I’m not that worried about the debate equilibrium not being exactly the same, though in most cases I expect in the limit that Amp(H,M)≈M such that you can in fact recover the debate equilibrium if you anneal towards debate.
The basic debate RL setup is meant to be unchanged here—when I say “the RL reward derived from winner” I mean that in the zero-sum debate game sense. So you’re still using self-play to converge on the Nash in the situation where you anneal towards debate, and otherwise you’re using that self-play RL reward as part of the loss and the supervised amplification loss as the other part.
The arguments should include what each debater thinks the answer to the question should be.
Yep.
But the answers are generated from pieces that involve humans, and those humans don’t behave as though they are in a zero-sum game?
I suppose you could imagine that the human is just some function, and the models are producing answers that get mapped through the function before they get their zero-sum reward… but then the equilibrium behavior could very well be different. For example, if you’re advising a human on how to play rock-paper-scissors, but they have a bias against paper and when you tell them to play paper they have a 33% chance of playing rock instead, you should now have a 50% chance of recommending paper, 33% chance of scissors, and 17% chance for rock. So I’m not sure that the reasons for optimism for debate transfer over into this setting where you have a human in the mix.
Maybe you could make an argument that for any H who we trust enough to do amplification / debate in the first place, this isn’t a problem, since Amp(H,M) is supposed to be more capable than M. Alternatively you could say that at the very least M is such that Amp(H,M) gives true and useful arguments, though that might conflict with training M to imitate Amp(H,M) (as in the rock-paper-scissors example above).
Yep; that’s basically how I’m thinking about this. Since I mostly want this process to limit to amplification rather than debate, I’m not that worried about the debate equilibrium not being exactly the same, though in most cases I expect in the limit that Amp(H,M)≈M such that you can in fact recover the debate equilibrium if you anneal towards debate.