Hi, author here.
Presumably we’re supposed to read between the lines and notice that this is a de-facto negative result.
FWIW, I genuinely see it as a positive result, and if I thought it should be read as a de facto negative result, I would make sure that was conveyed in the paper. I think the same is true of my coauthors.
There are reasons that we would expect a debate experiment like this to fail to detect an improvement, even if debate is a good paradigm for scalable oversight:
Previous work from our group using a simpler variant of debate (1 or 2 turn only) and smaller expertise gap (judge was allowed to skim the story within a time limit) didn’t find any improvements from “debate” (https://aclanthology.org/2022.lnls-1.3/; https://openreview.net/forum?id=9wLAwDrYDsQ). We discuss some hypotheses for why we ‘succeeded’ where they ‘failed’ in Sec. 7.
It is just really hard to collect very good data of this kind. Each human debate takes probably around 2 person-hours total, and it’s a lot of work to train, supervise, and motivate annotators. So the data’s gonna be small.
The benefits of debate should grow with the size of the expertise gap, but our expertise gap is still quite small (consisting only of knowledge of a story that takes ~30m to read). It being this small is kind of logistically necessary for our setup, as we need to be able to manufacture a fresh new expertise gap for every time someone judges a new question (otherwise information leakage between debates for the same story would compromise the judge’s blinding). Systematizing “expertise gaps” for a realistic-ish task turns out to be a tricky experimental design issue.
Especially given the small size of our expertise gap, consultancy is still a reasonably strong baseline with a persistent and skilled judge interrogating the consultant. Winning while dishonest in consultancy is easier than debate but far from trivial. Judges can prolong as long as they need and interrogate you about the evidence until they’re satisfied. The result is best understood in terms of an accuracy/efficiency tradeoff, which is why we gave the judge a turn penalty.
Some other relevant thoughts:
The human debate/consultancy difference was our primary endpoint where we expected a clear difference, since we expected a priori that performance in debate and consultancy would come apart as debaters get stronger. We didn’t do multiple hypothesis correction. I wish we had more data for the main comparison (and for the others!) but it was a lot of time and effort to collect. And the error bars in the graph are 95% CIs, not stdevs, so their size reflects n.
The gap from 84% to 100% accuracy on debates is largely explainable by silly mistakes by judges and debaters. Making sure all the participants are good and calibrated and patient and try their hardest is difficult even when you’re sitting in the room together (which we weren’t always). And debating well is hard too; it takes a long time and it’s often not obvious what the optimal argument is, or easy to miss the best quotes. There was a spectrum of judge skill (the top two had 100% accuracy in the 36 human debates they judged, faring comparatively worse in the other settings — though n is small ofc). Some of the errors are also explained by data issues (questions that are arguable or contain mistakes). Some mistakes in consultancy are similar, but many of seem harder to resolve with more careful training because they really are a matter of outmaneuvering the consultant.
The quantitative results from the feedback surveys comparing human and AI performance were statistically significant but honestly I would say it’s much more informative to just look at the transcripts; you can immediately tell a lot more about what’s wrong with the AI’s behavior, why AI and consultancy settings are harder to judge than human debate, and what needs to improve for AIs to debate well & usefully. The AIs are obviously much, much worse at this task.
I think debate still needs to be judged with harder questions, stronger judges and stronger debaters. Really pushing the limits and seeing more benefits from debate should hopefully be a lot easier once we can get models to debate well. But we also need better datasets. For future work we’re looking at different domains and bigger expertise gaps. See, for example, our new dataset GPQA: https://arxiv.org/abs/2311.12022
To briefly attend to assumptions: I am coming from a position of ‘debate optimism’ in the sense that I think debate-style supervision, done right, should be a strict improvement over RLHF, and I want to figure out how to make it work. I don’t think it’s a complete ‘solution’ for truthfulness but it seems to me like the best next step.
Thank you for posting this! I think it’s well worth thinking through debate baselines more.
TL;DR:
I think your open consultancy idea isn’t specific to consultancy. I think everything about it equally well applies to debate, and it basically corresponds to how the RL training should work in practice.
The 50⁄50 assumption in our consultancy and debate experiments was done for reasons having to do with our research question of interest, which is measuring the relative utility of the supervision paradigms. I don’t think we would want to directly use consultancy with a 50⁄50 prior to help us answer questions in practice and this wasn’t the intent behind our use of it as a baseline (speaking at least for myself).
In more detail:
The way that debate training should work in practice is that the reward signal of winning the debate should feed back into the policy that decides the answer. So a model trained using debate also should choose the answer that it thinks will be most convincing to the judge. It’ll then debate that against the next most likely(-to-win) answer. You can show these predicted win probabilities to the judge and it would be much the same as showing the top two answers (and their win probabilities) to the judge in consultancy, then arguing for the one assigned higher probability. This seems to me like the debate analog of open consultancy and is what should be compared to IMO.
It seems to me that doing CoT before deciding the answer is also possible in debate. You can just sample from or search/marginalize over the CoT distribution to get a distribution over answers and then use that; the reward signal can feed back into this in the same way as well.
So why did we use the 50⁄50 assumption for consultancy?
The main reason is we want the final judge accuracies to be comparable. Suppose you’re comparing debate (with no prior over answers) to an RLHF’d model which you are pretty sure gets 90% accuracy. Then you could just completely defer to the model and get 90% accuracy while not supervising it at all, whereas you’d have no way of doing this with debate. A fairer comparison would be against a debate-trained model which assigned probabilities to its answers; whether this gets higher accuracy is its own interesting question and is a result we would hope for based on the findings that we already have.
Furthermore, if the prior accuracies differ then it complicates the comparison. Since we’re interested in the information gain from the supervision method, if the models started with different baseline accuracies, we would need to correct for that when comparing the methods, and the results wouldn’t be as interpretable (as we couldn’t directly compare the resulting judge accuracies). Plus it might be harder for human judges to stay calibrated.
We’re interested in measuring the useful information the supervisor can get from the supervision method. If we already start with a lot of information (e.g., if the model is 90% accurate in the option it tells us is right) it becomes much harder to tell the difference between supervision methods, because the difference in information is going to be smaller since you can always do at least as well as deferring to the AI for a baseline. So for the purpose of comparing supervision paradigms most easily, starting from a 50⁄50 baseline (minimal starting information) makes it easier to spot an effect, if it exists.
Since we haven’t gotten debate training to work yet, there’s no clear way to assign probabilities to the answers, so we can start with a prior of 50⁄50. Then we just need to do the same for consultancy.
So: overall, yes I think open consultancy is a good idea for a baseline, but I think the proper comparison would be to ‘open’ debate, and it seems to me that the CoT aspect is orthogonal. It’s possible I’ve missed something in your proposal though. Does this all sound reasonable to you?
Also, a side question we might have to deal with in practice in the ‘open’ case is ensuring judge attentiveness and preventing premature deferral to the model once it becomes reasonably trustworthy. I think this is what you were getting at near the end of your post. In practice this might just mean trying to focus supervision on the error cases, which could basically just function to bring the prior back towards 50⁄50 anyway.
I think this is probably where the ‘open’ variant comparison gets more interesting and realistic, because what we’re really interested in is ensuring the robustness of the resulting system, and even if debate works much better when starting from the 50% baseline, I can imagine a world where consultancy catches up on a lot of that difference over training, or their error cases or answer distributions are significantly different, and what it takes to improve performance on the margin might look different between the two approaches in a way that we wouldn’t see just doing experiments with 50⁄50 priors on pre-selected answer pairs. I think we’re inevitably going to run into all of this stuff once we start doing RL training and letting the model pick the answers, which is where I see this research agenda going.