Presumably we’re supposed to read between the lines and notice that this is a de-facto negative result.
FWIW, I genuinely see it as a positive result, and if I thought it should be read as a de facto negative result, I would make sure that was conveyed in the paper. I think the same is true of my coauthors.
There are reasons that we would expect a debate experiment like this to fail to detect an improvement, even if debate is a good paradigm for scalable oversight:
Previous work from our group using a simpler variant of debate (1 or 2 turn only) and smaller expertise gap (judge was allowed to skim the story within a time limit) didn’t find any improvements from “debate” (https://aclanthology.org/2022.lnls-1.3/; https://openreview.net/forum?id=9wLAwDrYDsQ). We discuss some hypotheses for why we ‘succeeded’ where they ‘failed’ in Sec. 7.
It is just really hard to collect very good data of this kind. Each human debate takes probably around 2 person-hours total, and it’s a lot of work to train, supervise, and motivate annotators. So the data’s gonna be small.
The benefits of debate should grow with the size of the expertise gap, but our expertise gap is still quite small (consisting only of knowledge of a story that takes ~30m to read). It being this small is kind of logistically necessary for our setup, as we need to be able to manufacture a fresh new expertise gap for every time someone judges a new question (otherwise information leakage between debates for the same story would compromise the judge’s blinding). Systematizing “expertise gaps” for a realistic-ish task turns out to be a tricky experimental design issue.
Especially given the small size of our expertise gap, consultancy is still a reasonably strong baseline with a persistent and skilled judge interrogating the consultant. Winning while dishonest in consultancy is easier than debate but far from trivial. Judges can prolong as long as they need and interrogate you about the evidence until they’re satisfied. The result is best understood in terms of an accuracy/efficiency tradeoff, which is why we gave the judge a turn penalty.
Some other relevant thoughts:
The human debate/consultancy difference was our primary endpoint where we expected a clear difference, since we expected a priori that performance in debate and consultancy would come apart as debaters get stronger. We didn’t do multiple hypothesis correction. I wish we had more data for the main comparison (and for the others!) but it was a lot of time and effort to collect. And the error bars in the graph are 95% CIs, not stdevs, so their size reflects n.
The gap from 84% to 100% accuracy on debates is largely explainable by silly mistakes by judges and debaters. Making sure all the participants are good and calibrated and patient and try their hardest is difficult even when you’re sitting in the room together (which we weren’t always). And debating well is hard too; it takes a long time and it’s often not obvious what the optimal argument is, or easy to miss the best quotes. There was a spectrum of judge skill (the top two had 100% accuracy in the 36 human debates they judged, faring comparatively worse in the other settings — though n is small ofc). Some of the errors are also explained by data issues (questions that are arguable or contain mistakes). Some mistakes in consultancy are similar, but many of seem harder to resolve with more careful training because they really are a matter of outmaneuvering the consultant.
The quantitative results from the feedback surveys comparing human and AI performance were statistically significant but honestly I would say it’s much more informative to just look at the transcripts; you can immediately tell a lot more about what’s wrong with the AI’s behavior, why AI and consultancy settings are harder to judge than human debate, and what needs to improve for AIs to debate well & usefully. The AIs are obviously much, much worse at this task.
I think debate still needs to be judged with harder questions, stronger judges and stronger debaters. Really pushing the limits and seeing more benefits from debate should hopefully be a lot easier once we can get models to debate well. But we also need better datasets. For future work we’re looking at different domains and bigger expertise gaps. See, for example, our new dataset GPQA: https://arxiv.org/abs/2311.12022
To briefly attend to assumptions: I am coming from a position of ‘debate optimism’ in the sense that I think debate-style supervision, done right, should be a strict improvement over RLHF, and I want to figure out how to make it work. I don’t think it’s a complete ‘solution’ for truthfulness but it seems to me like the best next step.
Hi, author here.
FWIW, I genuinely see it as a positive result, and if I thought it should be read as a de facto negative result, I would make sure that was conveyed in the paper. I think the same is true of my coauthors.
There are reasons that we would expect a debate experiment like this to fail to detect an improvement, even if debate is a good paradigm for scalable oversight:
Previous work from our group using a simpler variant of debate (1 or 2 turn only) and smaller expertise gap (judge was allowed to skim the story within a time limit) didn’t find any improvements from “debate” (https://aclanthology.org/2022.lnls-1.3/; https://openreview.net/forum?id=9wLAwDrYDsQ). We discuss some hypotheses for why we ‘succeeded’ where they ‘failed’ in Sec. 7.
It is just really hard to collect very good data of this kind. Each human debate takes probably around 2 person-hours total, and it’s a lot of work to train, supervise, and motivate annotators. So the data’s gonna be small.
The benefits of debate should grow with the size of the expertise gap, but our expertise gap is still quite small (consisting only of knowledge of a story that takes ~30m to read). It being this small is kind of logistically necessary for our setup, as we need to be able to manufacture a fresh new expertise gap for every time someone judges a new question (otherwise information leakage between debates for the same story would compromise the judge’s blinding). Systematizing “expertise gaps” for a realistic-ish task turns out to be a tricky experimental design issue.
Especially given the small size of our expertise gap, consultancy is still a reasonably strong baseline with a persistent and skilled judge interrogating the consultant. Winning while dishonest in consultancy is easier than debate but far from trivial. Judges can prolong as long as they need and interrogate you about the evidence until they’re satisfied. The result is best understood in terms of an accuracy/efficiency tradeoff, which is why we gave the judge a turn penalty.
Some other relevant thoughts:
The human debate/consultancy difference was our primary endpoint where we expected a clear difference, since we expected a priori that performance in debate and consultancy would come apart as debaters get stronger. We didn’t do multiple hypothesis correction. I wish we had more data for the main comparison (and for the others!) but it was a lot of time and effort to collect. And the error bars in the graph are 95% CIs, not stdevs, so their size reflects n.
The gap from 84% to 100% accuracy on debates is largely explainable by silly mistakes by judges and debaters. Making sure all the participants are good and calibrated and patient and try their hardest is difficult even when you’re sitting in the room together (which we weren’t always). And debating well is hard too; it takes a long time and it’s often not obvious what the optimal argument is, or easy to miss the best quotes. There was a spectrum of judge skill (the top two had 100% accuracy in the 36 human debates they judged, faring comparatively worse in the other settings — though n is small ofc). Some of the errors are also explained by data issues (questions that are arguable or contain mistakes). Some mistakes in consultancy are similar, but many of seem harder to resolve with more careful training because they really are a matter of outmaneuvering the consultant.
The quantitative results from the feedback surveys comparing human and AI performance were statistically significant but honestly I would say it’s much more informative to just look at the transcripts; you can immediately tell a lot more about what’s wrong with the AI’s behavior, why AI and consultancy settings are harder to judge than human debate, and what needs to improve for AIs to debate well & usefully. The AIs are obviously much, much worse at this task.
I think debate still needs to be judged with harder questions, stronger judges and stronger debaters. Really pushing the limits and seeing more benefits from debate should hopefully be a lot easier once we can get models to debate well. But we also need better datasets. For future work we’re looking at different domains and bigger expertise gaps. See, for example, our new dataset GPQA: https://arxiv.org/abs/2311.12022
To briefly attend to assumptions: I am coming from a position of ‘debate optimism’ in the sense that I think debate-style supervision, done right, should be a strict improvement over RLHF, and I want to figure out how to make it work. I don’t think it’s a complete ‘solution’ for truthfulness but it seems to me like the best next step.