Very interesting! I’m pretty surprised what “more persuasive” means here: a priori I would have expected it to involve taking the gradient of the judge, a la GANs. The limitation to inference time means that this result generalizes much less far than I expected. The limitation of no gradient does match the limitations of having a human as a judge, at least in the short term capability regimes.
Off the cuff, I wonder to what degree this could be used to implement something primarily for human use.
I have to disagree; BoN is a really good approximation of what happens under RL-finetuning (which is the natural learning method for multi-turn debate).
I do worry “persuasiveness” is the incorrect word, but it seems to be a reasonable interpretation when comparing debaters A and B. E.g. for a given question and set of answers, if A wins independent of the answer assignment (e.g no matter what answer it has to defend) it is more persuasive then B.
“More persuasive” here means a higher win rate in debate, which I think is the same thing it would mean in any debate context? I agree the limitation to inference time rather than training is definitely important to keep in mind. I think that best-of-N using the judge as a preference model is a reasonable approximation of moderate amounts of RL training, but doing actual training would allow us to apply a lot more optimization pressure and get a wider spread of Elos. There has been some good debate RL work done in a similar setting here, and I’d love to see more research done with debate-trained models.
Right, but it wasn’t actually optimized on persuasiveness by a gradient, the optimization is weak inference time stuff. I’m not saying the word is used wrong, just that I was surprised by it not being a gradient.
Very interesting! I’m pretty surprised what “more persuasive” means here: a priori I would have expected it to involve taking the gradient of the judge, a la GANs. The limitation to inference time means that this result generalizes much less far than I expected. The limitation of no gradient does match the limitations of having a human as a judge, at least in the short term capability regimes.
Off the cuff, I wonder to what degree this could be used to implement something primarily for human use.
I have to disagree; BoN is a really good approximation of what happens under RL-finetuning (which is the natural learning method for multi-turn debate).
I do worry “persuasiveness” is the incorrect word, but it seems to be a reasonable interpretation when comparing debaters A and B. E.g. for a given question and set of answers, if A wins independent of the answer assignment (e.g no matter what answer it has to defend) it is more persuasive then B.
“More persuasive” here means a higher win rate in debate, which I think is the same thing it would mean in any debate context? I agree the limitation to inference time rather than training is definitely important to keep in mind. I think that best-of-N using the judge as a preference model is a reasonable approximation of moderate amounts of RL training, but doing actual training would allow us to apply a lot more optimization pressure and get a wider spread of Elos. There has been some good debate RL work done in a similar setting here, and I’d love to see more research done with debate-trained models.
Right, but it wasn’t actually optimized on persuasiveness by a gradient, the optimization is weak inference time stuff. I’m not saying the word is used wrong, just that I was surprised by it not being a gradient.