Rohin Shah comments on Synthesizing amplification and debate

Rohin Shah 10 Feb 2020 12:12 UTC
LW: 4 AF: 3
AF
The basic debate RL setup is meant to be unchanged here—when I say “the RL reward derived from winner” I mean that in the zero-sum debate game sense.
But the answers are generated from pieces that involve humans, and those humans don’t behave as though they are in a zero-sum game?
I suppose you could imagine that the human is just some function, and the models are producing answers that get mapped through the function before they get their zero-sum reward… but then the equilibrium behavior could very well be different. For example, if you’re advising a human on how to play rock-paper-scissors, but they have a bias against paper and when you tell them to play paper they have a 33% chance of playing rock instead, you should now have a 50% chance of recommending paper, 33% chance of scissors, and 17% chance for rock. So I’m not sure that the reasons for optimism for debate transfer over into this setting where you have a human in the mix.
Maybe you could make an argument that for any $H$ who we trust enough to do amplification / debate in the first place, this isn’t a problem, since $Amp (H, M)$ is supposed to be more capable than $M$ . Alternatively you could say that at the very least $M$ is such that $Amp (H, M)$ gives true and useful arguments, though that might conflict with training $M$ to imitate $Amp (H, M)$ (as in the rock-paper-scissors example above).
- evhub 10 Feb 2020 20:00 UTC
  LW: 4 AF: 3
  AF Parent
  Yep; that’s basically how I’m thinking about this. Since I mostly want this process to limit to amplification rather than debate, I’m not that worried about the debate equilibrium not being exactly the same, though in most cases I expect in the limit that $Amp (H, M) \approx M$ such that you can in fact recover the debate equilibrium if you anneal towards debate.