The basic debate RL setup is meant to be unchanged here—when I say “the RL reward derived from winner” I mean that in the zero-sum debate game sense.
But the answers are generated from pieces that involve humans, and those humans don’t behave as though they are in a zero-sum game?
I suppose you could imagine that the human is just some function, and the models are producing answers that get mapped through the function before they get their zero-sum reward… but then the equilibrium behavior could very well be different. For example, if you’re advising a human on how to play rock-paper-scissors, but they have a bias against paper and when you tell them to play paper they have a 33% chance of playing rock instead, you should now have a 50% chance of recommending paper, 33% chance of scissors, and 17% chance for rock. So I’m not sure that the reasons for optimism for debate transfer over into this setting where you have a human in the mix.
Maybe you could make an argument that for any H who we trust enough to do amplification / debate in the first place, this isn’t a problem, since Amp(H,M) is supposed to be more capable than M. Alternatively you could say that at the very least M is such that Amp(H,M) gives true and useful arguments, though that might conflict with training M to imitate Amp(H,M) (as in the rock-paper-scissors example above).
Yep; that’s basically how I’m thinking about this. Since I mostly want this process to limit to amplification rather than debate, I’m not that worried about the debate equilibrium not being exactly the same, though in most cases I expect in the limit that Amp(H,M)≈M such that you can in fact recover the debate equilibrium if you anneal towards debate.
But the answers are generated from pieces that involve humans, and those humans don’t behave as though they are in a zero-sum game?
I suppose you could imagine that the human is just some function, and the models are producing answers that get mapped through the function before they get their zero-sum reward… but then the equilibrium behavior could very well be different. For example, if you’re advising a human on how to play rock-paper-scissors, but they have a bias against paper and when you tell them to play paper they have a 33% chance of playing rock instead, you should now have a 50% chance of recommending paper, 33% chance of scissors, and 17% chance for rock. So I’m not sure that the reasons for optimism for debate transfer over into this setting where you have a human in the mix.
Maybe you could make an argument that for any H who we trust enough to do amplification / debate in the first place, this isn’t a problem, since Amp(H,M) is supposed to be more capable than M. Alternatively you could say that at the very least M is such that Amp(H,M) gives true and useful arguments, though that might conflict with training M to imitate Amp(H,M) (as in the rock-paper-scissors example above).
Yep; that’s basically how I’m thinking about this. Since I mostly want this process to limit to amplification rather than debate, I’m not that worried about the debate equilibrium not being exactly the same, though in most cases I expect in the limit that Amp(H,M)≈M such that you can in fact recover the debate equilibrium if you anneal towards debate.