“More persuasive” here means a higher win rate in debate, which I think is the same thing it would mean in any debate context? I agree the limitation to inference time rather than training is definitely important to keep in mind. I think that best-of-N using the judge as a preference model is a reasonable approximation of moderate amounts of RL training, but doing actual training would allow us to apply a lot more optimization pressure and get a wider spread of Elos. There has been some good debate RL work done in a similar setting here, and I’d love to see more research done with debate-trained models.
Right, but it wasn’t actually optimized on persuasiveness by a gradient, the optimization is weak inference time stuff. I’m not saying the word is used wrong, just that I was surprised by it not being a gradient.
“More persuasive” here means a higher win rate in debate, which I think is the same thing it would mean in any debate context? I agree the limitation to inference time rather than training is definitely important to keep in mind. I think that best-of-N using the judge as a preference model is a reasonable approximation of moderate amounts of RL training, but doing actual training would allow us to apply a lot more optimization pressure and get a wider spread of Elos. There has been some good debate RL work done in a similar setting here, and I’d love to see more research done with debate-trained models.
Right, but it wasn’t actually optimized on persuasiveness by a gradient, the optimization is weak inference time stuff. I’m not saying the word is used wrong, just that I was surprised by it not being a gradient.