Thank you for posting this! I think it’s well worth thinking through debate baselines more.
TL;DR:
I think your open consultancy idea isn’t specific to consultancy. I think everything about it equally well applies to debate, and it basically corresponds to how the RL training should work in practice.
The 50⁄50 assumption in our consultancy and debate experiments was done for reasons having to do with our research question of interest, which is measuring the relative utility of the supervision paradigms. I don’t think we would want to directly use consultancy with a 50⁄50 prior to help us answer questions in practice and this wasn’t the intent behind our use of it as a baseline (speaking at least for myself).
In more detail:
The way that debate training should work in practice is that the reward signal of winning the debate should feed back into the policy that decides the answer. So a model trained using debate also should choose the answer that it thinks will be most convincing to the judge. It’ll then debate that against the next most likely(-to-win) answer. You can show these predicted win probabilities to the judge and it would be much the same as showing the top two answers (and their win probabilities) to the judge in consultancy, then arguing for the one assigned higher probability. This seems to me like the debate analog of open consultancy and is what should be compared to IMO.
It seems to me that doing CoT before deciding the answer is also possible in debate. You can just sample from or search/marginalize over the CoT distribution to get a distribution over answers and then use that; the reward signal can feed back into this in the same way as well.
So why did we use the 50⁄50 assumption for consultancy?
The main reason is we want the final judge accuracies to be comparable. Suppose you’re comparing debate (with no prior over answers) to an RLHF’d model which you are pretty sure gets 90% accuracy. Then you could just completely defer to the model and get 90% accuracy while not supervising it at all, whereas you’d have no way of doing this with debate. A fairer comparison would be against a debate-trained model which assigned probabilities to its answers; whether this gets higher accuracy is its own interesting question and is a result we would hope for based on the findings that we already have.
Furthermore, if the prior accuracies differ then it complicates the comparison. Since we’re interested in the information gain from the supervision method, if the models started with different baseline accuracies, we would need to correct for that when comparing the methods, and the results wouldn’t be as interpretable (as we couldn’t directly compare the resulting judge accuracies). Plus it might be harder for human judges to stay calibrated.
We’re interested in measuring the useful information the supervisor can get from the supervision method. If we already start with a lot of information (e.g., if the model is 90% accurate in the option it tells us is right) it becomes much harder to tell the difference between supervision methods, because the difference in information is going to be smaller since you can always do at least as well as deferring to the AI for a baseline. So for the purpose of comparing supervision paradigms most easily, starting from a 50⁄50 baseline (minimal starting information) makes it easier to spot an effect, if it exists.
Since we haven’t gotten debate training to work yet, there’s no clear way to assign probabilities to the answers, so we can start with a prior of 50⁄50. Then we just need to do the same for consultancy.
So: overall, yes I think open consultancy is a good idea for a baseline, but I think the proper comparison would be to ‘open’ debate, and it seems to me that the CoT aspect is orthogonal. It’s possible I’ve missed something in your proposal though. Does this all sound reasonable to you?
Also, a side question we might have to deal with in practice in the ‘open’ case is ensuring judge attentiveness and preventing premature deferral to the model once it becomes reasonably trustworthy. I think this is what you were getting at near the end of your post. In practice this might just mean trying to focus supervision on the error cases, which could basically just function to bring the prior back towards 50⁄50 anyway.
I think this is probably where the ‘open’ variant comparison gets more interesting and realistic, because what we’re really interested in is ensuring the robustness of the resulting system, and even if debate works much better when starting from the 50% baseline, I can imagine a world where consultancy catches up on a lot of that difference over training, or their error cases or answer distributions are significantly different, and what it takes to improve performance on the margin might look different between the two approaches in a way that we wouldn’t see just doing experiments with 50⁄50 priors on pre-selected answer pairs. I think we’re inevitably going to run into all of this stuff once we start doing RL training and letting the model pick the answers, which is where I see this research agenda going.
This looks great! Let me see if I understand the big picture correctly (omitting some of your experiments to just get the main thrust):
You finetuned GPT-3.5 alternatively to output the correct answer 1) on 500 MMLU questions, and 2) on debates between copies of GPT-4o on those same 500 MMLU questions. You found that validation accuracy judging debates was better than answering blind.
This difference did not appear when using GPT-4o as a judge, suggesting that in the capability asymmetric case, the judge learned to rely on the debate, whereas in the capability symmetric case, it did not.
Testing this trained judge with different debaters, you find that Elo of the debater models and accuracy of the debate result track well with each other. Strangely though, Best-of-4 decoding on the debaters does not seem to increase Elo?
This shows an example of a case where judge training in case of capability asymmetry actually seems to produce the desired behavior in debates (i.e., the judge relies on the debate and can use it to generalize well). Main issue that comes to mind:
I worry about how much of what we’re seeing is just an effect of domain shift. Since you trained the model on GPT-4o debates, I would expect the accuracy on these debates to be highest, and changing to GPT-4o mini and then GPT-3.5 should lead us further out of domain, reducing the judging model’s accuracy. Then the accuracy trend just reflects how OOD the debates are, and that happens to track with model skill for the debaters you tested. The fact that Elo also tracks in the expected way is a bit harder to explain away here, and makes it seem like the judge is learning something meaningful, but I am pretty unsure about that.
I think I would see these results as a lot stronger if BoN panned out and showed the expected Elo/accuracy relation, but it seems like it does not.
What do you think of this? Anything I’m wrong about or missing here?
Also low-level question. You say above the Elo/accuracy plot:
What does this mean? I would assume Elo needs to be computed by running a tournament between the models.