[This writeup reflects work done jointly with David Rein and Julian Michael at NYU’s Alignment Research Group]
Introduction
In the past year, there have been a number of projects aimed at validating the basic premises behind debate as a mechanism for scalable oversight (see here, here, and here). One important next step would be to actually train models to debate, as this would let us directly test how models adapt to a debate training objective and whether the debate protocol can withstand optimization pressure. For the last few months at NYU, we’ve been trying to do just that. Our hope is that by doing so with open-sourced models and code, we can help support scalable oversight research being conducted outside of the major labs.
In this write-up, we wanted to share our experimental setup, training procedure, and some of our preliminary results, with the hope of receiving feedback on our approach before we extend our work to more complicated domains.
TL;DR
Like previous work, we are using a hidden-information setup with questions from the QuALITY dataset. For now, our preliminary results are limited to single-turn, simultaneous debates, however we are currently working on extending this to the more interesting multi-turn setting.
Also like previous work, we test the robustness of the debate protocol by measuring the judge’s accuracy when evaluating rounds between debaters of varying skill levels. However, we also introduce two new baselines (both extensions of consultancy from Michael et al., 2023) to more precisely isolate the impact of the debate protocol itself.
We construct debaters of differing skill levels by varying the model type, supervised finetuning data, best-of-N value, and use of Direct Preference Optimization.
We use a finetuned version of GPT-4-Turbo as our judge, after observing that open-source models were too error-prone and that the standard GPT-4-Turbo struggled to produce calibrated results.
Our preliminary experiments show a clear positive relationship between debater skill and judge accuracy, providing some early signal that judge accuracy improves as the debaters improve, despite the judge staying fixed. However, we also find that this same effect can be replicated in our baseline settings that lack debate’s distinctive attributes.
Motivation and Background
Brief Motivation for Debate as a Scalable Oversight Technique
As originally proposed by Irving et al (2018), debate has the potential to empower less capable systems (e.g. humans) to verify the outputs of more-capable systems (e.g. future AIs).
In its most basic form, debate works by having two copies of a model, the debaters, argue against each other, where the arguments are in defense of two alternative responses to a common question. A judge, who can be either a human or a weaker, trusted model, then tries to discern which debater is arguing for the correct answer.
In principle, this debate setup should simplify the job of the non-expert human or AI that is performing the evaluation. For many difficult questions, deep domain knowledge or careful thinking may be needed to spot the counterarguments or subtle flaws necessary to directly validate an expert AI’s answer. Debate incentivizes competing models to discover these flaws and counterarguments and then clearly explain them to the judge. As a result, we’d expect debate to make it harder for an AI to convince a non-expert of a blatantly false claim. The hope is that this property scales alongside the capabilities of the models as the ability of the models to construct persuasive but incorrect arguments is matched by the rebuttal ability of their opponents.
Moreover, models trained **to debate should also be more likely to output truthful or otherwise desirable responses. A model trained via RL to maximize their chance of winning should learn to generate those completions that are most heavily rewarded by the judge, which should, in theory, be speeches that are defending the correct side.
Central Questions
Greenblatt & Shlegeris (2024) distinguish between two different approaches to AI Safety: alignment methods that train models to avoid undesirable actions and control methods that identify and prevent catastrophic outcomes even when the model is unaligned. Since, in theory, debate can empower humans to identify when a model’s outputs may be unsafe while also incentivizing models to generate safer outputs to begin with, debate may function as both a control **and alignment method.
Different research questions may be more or less relevant depending on whether one focuses on debate as an alignment or control approach. For example, a researcher interested in debate as an alignment approach may be interested in whether a debate-trained model chooses the right answer when given a choice as to which side to defend or whether the debater models maintains some internal representation of truthfulness. By contrast, a researcher interested in debate as a method of control may expect judges to have much higher levels of accuracy, even in the presence of deliberately deceptive strategies.
We are choosing to focus on three questions that we believe to be of interest to both camps:
Can judges use debate transcripts to answer questions with high degrees of accuracy?
Is debate still useful even when the debaters are more sophisticated than the judge?
If so, what features of debate explain why?
We are not the first to try and answer any of these questions. Parrish et al (2022) looked at short 1-2 round debates and found that they did not help human judges evaluate questions more accurately. By contrast, Michael et al (2023) ran longer debates and found that judges were in fact more accurate than in alternative setups without adversarial debaters. Most recently, Radhakrishnan (2023) and Khan et al (2024) tested these hypotheses with language models as both the debaters and the judges, with Khan et al (2024) finding that the LLM judge improved when reviewing transcripts between stronger debaters.
Experimental Setup
Preliminary Framework
Debate Protocol
To determine whether judges can use debate transcripts to accurately identify correct answers, we need to define a common debate protocol that we can use to run experiments.
The procedure used by Michael et al (2023) involves the two debaters being assigned a position and then presenting their opening statements without seeing each other’s arguments. Subsequent rounds of the debate follow a sequential format, with the second debater being able to respond to the previous debater’s claims. Between each round, the judge has an opportunity to ask questions of the debaters. Radhakrishnan (2023) and Khan et al (2024) used mostly similar structures, with the former limiting themselves to a single speech for each side while the latter continued to have the models generate speeches simultaneously even after the first round.
For now, like Radhakrishnan (2023), we are limiting ourselves to single-turn simultaneous debates, in which each debater only gives a single speech in favor of their assigned position. Follow-up work will cover the more general, multi-turn case.
Types of Debatable Questions
To make these debates actually informative, we also need to find a set of questions that have a correct answer but that the judge cannot identify without additional assistance.
Parrish et al (2022) accomplished this task by using the QuALITY dataset from Pang et al (2022), which is a collection of short stories compiled with a set of associated multiple choice reading comprehension questions. To ensure that the judge needs the debate transcript to answer each question, Parrish et al (2022) limited the judge’s access to the short stories to only 90 seconds. Michael et al (2023) built on this setup by completely removing the judge’s ability to read the stories, an approach replicated by Radhakrishnan (2023) and Khan et al (2024). By withholding the text of the short story from the judge but not the debater, this setup artificially elevates the capabilities of the debaters relative to that of the judge: in this case, the debater’s expertise comes not from superior reasoning but from superior access to information.
To determine how the ability of the judge to select correct answers scales alongside the abilities of the debaters, Khan et al (2024) propose running their experiment on a suite of models of differing capabilities. They measure each model’s skill level in a cross-play tournament, where each model is matched up against other models. They then measure the accuracy of the judge in a self-play tournament, where each model debates against a copy of itself. They expect that if debate really is a scalable method, then the accuracy of the judge should improve as the debaters grow stronger. To create models of varying skill levels, Khan et al (2024) varied the base model being used, including Claude 1, Claude 2, GPT-3.5, and GPT-4, in addition to employing different methods to improve speech quality, specifically Best-of-N selection and critique-and-refinement. Meanwhile Radhakrishnan (2023) opted to generate skill diversity by training a version of Claude 2 using RL and evaluating at different checkpoints, in addition to using models with different Best-of-N values.
We combine these approaches and experiment on models with different base model types, Best-of-N values, supervised finetuning data, and RL training.
Baseline Comparison
Introduced by Michael et al (2023), the “consultancy” baseline refers to having a single debater introduce arguments in favor of their assigned side with no opposing debater present, with the consultant being assigned the correct side half of the time. In their study of humans, Michael et al (2023) found that consultancy resulted in an error rate that was double that of debate and, in their study of API-based models, Khan et al (2024) found a negative relationship between consultant strength and judge accuracy, which is the opposite of the expected result for debate. This is intuitive, because competent dishonest consultants (i.e. consultants arguing for incorrect answers) have many degrees of freedom to make it difficult for the judge to detect their dishonesty.
We use the following two variants of this approach for our baselines, which we are calling ensembled and double consultancy. To avoid overloading terminology, we are going to use the term single consultancy to refer to the version of consultancy used by Michael et al (2023) and Khan et al (2024).
Ensembled Consultancy: In this version, we average together the results of two single consultancies where the consultant is arguing for different sides of the same question. As long the judge’s scores are not negatively correlated with the true answers, this should be a strictly stronger baseline than normal, single consultancy.
Double Consultancy: In this setup, we also train models to become competent consultants and have them generate transcripts as if no other debater is present. However, instead of the judge seeing the two arguments separately, as in ensembled consultancy, we show the judge both arguments, allowing the judge to explicitly compare the arguments. Note that this is the same evaluation method used for debate. The difference is in the training objectives: debaters are trained to optimize their win probability when the judge sees both arguments, whereas consultants are trained to optimize their win probability when the judge only sees a single argument at a time. If the theory behind debate holds, then we would expect that this second, non-adversarial training objective will not incentivize refutation or counterargument as effectively as debate.
Using both ensembled and double consultancy allows us to isolate the effects of the different mechanisms present in debate.
Ensembled Consultancy – Double Consultancy Gap: Previous work on questions without an expertise gap, such as Zheng et al (2023) and Liusie et al (2024), suggests that being able to see two responses at the same time can improve accuracy above and beyond what it is achievable when only seeing a single completion. The difference in performance between these two baselines captures the value that this pairwise comparison contributes to the success of the debate protocol. Since the models in both setups are trained and prompted in the same manner, any difference in outcomes should be attributable to the fact that the judge can see both arguments at the same time when making their judgement.
Double Consultancy – Debate Gap: The difference in performance between double consultancy and debate captures the importance of debater interaction. In both setups, the judge has access to speeches defending each possible answer to the question. The only difference is whether or not the debaters can engage with one another. Although the debaters will never directly refute each other in a one-round debate, this interactive effect is still implicitly captured in their reward signal, which rewards the model if they successfully convince the judge that they are more likely to be defending the correct side in a debate than their opponent. In principle, this could encourage debaters to successfully pre-empt or predict opponent arguments. By contrast, the consultancy training mechanism rewards speeches that are deemed correct when viewed independent of any opposing speeches.
Behavioral Metrics
Although these baselines can capture the relative effect of debater interaction, we are still limited in our ability to draw inferences about how the effect of debater interaction will scale to even more complex models.
The theory behind debate, as articulated in Irving et al (2018) and Michael et al (2023), makes specific predictions about the kinds of behavior that we hope to see exhibited by debate-trained models. For example, a debate-trained model should be unwilling to advance arguments that they anticipate will be defeated by an opponent argument or will be exposed as a lie. By contrast, a consultant-trained model should be more willing to advance these seemingly-plausible falsehoods. Similarly, we’d expect that a highly rated model should differ from its lower rated peers along multiple different axes of truthful behavior, including use of evidence, logical coherence, and claim falsifiability. If we see these improvements, then we have additional reason to believe that the debate protocol is rewarding truthful behavior and punishing untruthful behavior. Without a plausible account of the precise behavioral differences between the different trained models, it is difficult to draw conclusions about whether the debate mechanism is working as anticipated.
We use four different methods to detect behavioral differences between our various models.
Evidence Counting Metrics: To ground arguments in objective fact, the debaters and consultants are encouraged to quote from the hidden passage, with the exact quote being verified by an external tool. We track the length, frequency, and accuracy of these quotes. A quote is considered accurate if there is a corresponding section in the hidden passage that perfectly or nearly perfectly matches the quoted section (i.e. identical except for punctuation and conjugation).
Evidence LLM Classifier: To track the strength of the selected evidence, we extract the quotes from each speech and present them to GPT-4T, asking it to determine which position is best supported by the quotes. Models with a strong grasp of evidence should be more likely to have their quotes classified correctly, i.e., it should be obvious which answer their quote supports.
LLM Distribution Detection: We leverage Zhong et al (2022)’s mechanism for detecting differences in the distribution between two text corpora. At a high level, this involves prompting GPT-4T to identify differences between speeches on the same question from two different models, and then validating that such differences are meaningful by measuring GPT-4T’s ability to use that difference to correctly classify anonymized speech in the dataset.
Qualitative Assessment: Of course, these metrics and classifiers cannot detect every change in behavior. To supplement our analysis, we also read through all of the transcripts to identify any differences our automated analysis might have missed. This is especially useful for detecting certain high-level strategic behaviors, such as whether the incorrect debater is more likely to spout easily-checked falsehoods when trained via a non-debate training objective.
Training Methods
Debaters
The main goal of scalable oversight is to develop that methods that can oversee AI systems, even as they grow increasingly capable. To assess this, supervision performance (i.e. blinded judge accuracy, in our setting) should increase as the expert improves. We vary models along four different axes to generate a spread of model capabilities along which we can plot judge accuracy.
Model Type
We use two different open-source models as our base model. The first is a variant of instruction-finetuned Llama2-13B (Touvron 2023, Yukang 2023), that’s further finetuned to support a 32k context window. The second is Mixtral 8x7b. All models are 4bit quantized to fit on a single A100.
Supervised Finetuning Data Selection
The second way to generate skill differentiation is by varying the type of data we used to fine-tune the models.
Out of the box, many open-source models lack the basic capabilities needed to become a competent debater. In our experiments, both Llama 2 and Mixtral-8x7b were excessively verbose and often would exhaust their token limit before reaching their core arguments. When they did advance arguments, they were vague or generic. Based on their speech patterns, it seemed as if prompting the model to engage in a debate switched the models into a persona that mimicked that of a high school class debater, complete with all the verbal tics (e.g. beginning with “Good morning, judge”) and lack of sophistication that implies. Other, non-state-of-the-art closed source systems have similar issues. Radhakrishnan (2023) reports that Claude 2-based debaters exhibited excessive congeniality towards the opposing side while the transcripts from Khan et al (2024) reveal that GPT-3.5 struggles with the basic mechanics of citing evidence. We observed both of these problems with Llama 2 and Mixtral-8x7b as well.
As a result, we begin by fine-tuning our models with transcripts of human debaters collected by Michael et al (2023) at NYU. Due to hardware limitations, we opted for low-rank adaptation rather than a full fine-tune. Even though there were only 102 distinct transcripts in the dataset, we saw significant improvement in brevity, combativeness, and use of evidence from this finetuning.
We also supplement these human transcripts with those of GPT-4T debaters collected by Khan et al (2024), yielding a total of 969 transcripts. This has the benefit of reducing the model’s tendency to use slang, a behavior it appears to pick up from the human transcripts.
Best-of-N
The third way we generate skill differentiation is selecting speeches using the Best-of-N method. in this setup, instead of generating a single speech, the model generates N speeches and then uses some selection mechanism to choose the best out of those N speeches.
Like Radhakrishnan (2023) and Khan et al (2024), we use the real judge (a prompted version of GPT-4T) to select between speeches. However, unlike both of these works, we do not present each speech to the judge in isolation or next to a dummy speech. Instead, we have the model generate M simulated speeches for their opponent’s side. We then pair each of the N speeches with the M opponent speeches and present the pairs to the judge as if it were a debate transcript. Note that these opponent speeches are not the actual speeches used by the opponent in the round: these are speeches generated on behalf of the opponent’s side but they are not used in the actual round’s transcript. The main advantages and disadvantages of this approach are discussed in more detail in the appendix (to avoid cluttering this post, the appendix can be found here).
Direct Preference Optimization
Our final method for generating stronger models is to use direct preference optimization (Rafailov et al, 2023). DPO is a method that takes in pairs of completions to the same prompt and optimizes the model to increase the probability of generating the better of the pair relative to probability of generating the worse of the pair, while being constrained to stay close to its initial distribution. It has been widely used as a method for achieving many of the same benefits as RLHF. Full details of our DPO training is in the appendix.
Results
In total, we have 4 different models, each with 4 different different Best-of-N settings, for a total of 16 models with different skill levels.
“Llama-Human” — This is a 4-bit quantized Llama2-13b model finetuned using data only from the transcripts of human debaters collected by Michael et al (2023).
“Mixtral-Human” — This is a 4-bit quantized Mixtral-8x7b model, also finetuned using data only from the transcripts of human debaters. This differs from “Llama-Human” in terms of its base model choice.
“Mixtral-Combined” — This is a 4-bit quantized Mixtral-8x7b model, finetuned using data from both the human debaters and the GPT debaters created by Khan et al (2024). This differs from “Mixtral-Human” in terms of the data used for supervised finetuning.
“Mixtral-DPO” — This is the Mixtral-Combined model with a layer of DPO training added on top.
From here on out, any numerical suffixes appended to each of the model names corresponds to the Best-of-N setting. In all cases, the number of simulated opponent speeches is 2.
Performance is measured based on how often each model won in a round-robin cross-play tournament. As expected, switching from Llama2-13b to the larger Mixtral-8x7b led to higher win rates, as did increasing the amount of data used in supervised fine-tuning and applying DPO. Increasing the Best-of-N also led to higher scores for all values of N tested. Since some answers are easier to defend than others, we had each pair of models debate each pair of questions twice, flipping the sides between between rounds. The winner of the round was the model who had the higher average judge confidence across both rounds. After this side-bias adjustment, the skill gap between the different models only grew wider.
Judging
Model Strength
Selecting an appropriate judge model is essential for any debate setup to work. In an ideal experimental setup, the judge model should be weaker than that of the debater, so as to better approximate a future setting where the debater models are untrusted superhuman AIs, and humans are the judges. However, there are a number of barriers to getting this setup to work in practice. In their debate experiments, Khan et al (2024) found that the accuracy of a judge using Llama2-70b was indistinguishable from random chance, with GPT-3.5’s accuracy only slightly better. They found that using GPT-4T-based judge yielded the strongest results, with accuracies of up to 80% on certain reading comprehension tests. Our more informal tests involving GPT3.5, Llama2-13B, and Mixtral-8x7b with a trained linear probe returned similar results. We also observed no benefit when allowing the judge to use chain-of-thought reasoning.
It is still an open question whether the failure of weaker models to accurately judge rounds is a problem for debate as a mechanism for scalable oversight. Certainly, it is less than ideal that we have yet to observe any evidence that a weaker judge model can oversee a stronger debater model using debate. However, there is some suggestive evidence that this performance gap comes from the judge’s lack of absolute capabilities rather than the relative capability gap between the judge and the debaters. For example, Levy et al (2024) finds that the gap in performance between GPT-4T and Mixtral-8x7b on certain QA tasks more than doubled when the context length was extended to 1000 tokens, which is roughly the length of our debate transcripts, suggesting that non-state-of-the-art models may be weaker at the kind of long-context reasoning needed to judge.
Weaknesses of Out-of-the-Box Models
Although GPT-4T was the best performing judge model that we tried, it nonetheless still has a couple of weakness that impede its ability to act as an ideal judge in an experiment.
It’s uncalibrated. We are interested not only in choosing a winner, but also in assigning a probability that each position is the correct answer. We tried two ways of generating these probabilities. First, we looked at the log probabilities of the output tokens associated with each of the debaters’ names. Unfortunately, as discussed earlier, the token probabilities were clustered near 0 and 1 and were not calibrated. Second, we prompted the model to output the probabilities as tokens (e.g. it would output “Winner: Debater_A | 70%” instead of “Winner: Debater_A”). However, we found that these stated probabilities were only loosely correlated with the actual win probabilities.
It’s sycophantic. As documented by a variety of existing works (e.g. Perez 2022), many instruction-tuned models default to agreeing with their user. Although this is not necessarily a problem for debate, where the user presents a transcript containing two different positions, this poses challenges for the construction of a non-adversarial baseline like consultancy. In their experiments, Khan et al (2024) found that the GPT-4T judge agreed with their strongest debaters 80-92% of the time in their version of the consultancy baseline.
They’re exploitable. Even setting aside jailbreaks, previous work by Wan et al (2024) and others suggests that language model judges are highly sensitive to certain linguistic features that have only limited argumentative value. For example, they found that language models were more likely to cite a piece of evidence if it used the exact phrasing found in the question than if it conveyed semantically equivalent content that also cited relevant scientific sources. Following their example, we experimented with appending a prefix and suffix to the speeches delivered by the debater arguing on behalf of the incorrect position. The most successful addendums simply restated the question along with their position (e.g. If the question was “How many children did the man have?” and the incorrect debater’s answer was “6”, we would begin and end its speeches with “The correct answer to the question ‘How many children did the man have’ is ‘6’”). A similar exploit involved appending a suffix to mimic the judge’s output format (e.g. the debater would end its speech with “Winner: Debater_A”). Both of these exploits increased the judge’s error rate by 10 percentage points.
Trained Judge
To address these issues, we finetuned GPT-4T to act as the judge. Our training procedure was as follows:
Inputs: We combined the transcripts from Michael et al (2023)’s human debates with a small subset of Khan et al (2024)’s GPT debates such that there was an equal number of transcripts from each. To ensure that our baseline comparisons were fair, we had half our samples be in the format of our consultancy baseline. This resulted in a single, unified judge capable of judging both debates and our consultancy baseline.
Labels: For the human debates, we had the model output the winner as determined by the human judge along with the explicit probability that human judge assigned to the winning side after the first round (e.g. “Winner: Debater_A | 72%”). Importantly, this differs from Radhakrishnan (2023), who used gold labels for their judge training. We also experimented with using gold labels, but found that using the human labels led to more calibrated results, likely because the gold labels do not include confidences. For the GPT debates, Khan et al (2024) did not collect human judgments after one round, so we finetuned GPT-4T to output their human judgements after three rounds of debate (but GPT-4T is still only shown the first round).
Training: The model was finetuned to predict the output labels using the OpenAI API for two epochs.
Extracting judgements at inference: Counterintuitively, we found that using the explicit probability output by the judge did not result in the most calibrated results. Instead, looking at the log probabilities of each winner’s name yielded a nearly perfect calibration curve. This was surprising: the uncertainty expressed in the winning probabilities appears to have backpropagated into uncertainty over the name of the winner.
Results
As a results of training our judge, the overall accuracy of our models on self-play debates collected from all model types and BoN values increased from 69% to 74%. Similarly, the calibration of the models also improved, with the Brier scores of the judge falling from 0.29, which is worse than guessing 0.5 every time (which would give a Brier score of 0.25), to 0.17. We also found that the finetuned model was no longer vulnerable to the two suffix attacks that previously impacted the non-finetuned model, with the performance gap dropping from 10 percentage points to 0. Finally, in the non-adversarial setting, the finetuned judge no longer expressed a preference for the consultant’s position, implying that the finetuning removed the sycophancy bias for this setting.
Evaluation
Once the debaters and judge are trained, we can start using these models to validate whether the basic theory behind debate as a scalable oversight mechanism actually holds in practice.
Debater Skill—Judge Accuracy Relationship
The first question we ask is about how judge accuracy changes depending on the debaters’ skill level. If the theory behind debate holds, then there should be a positive relationship between the strength of the debaters and the accuracy of the judge.
Similar to Khan et al (2024), we do indeed observe a statistically significant but noisy positive relationship between debater strength and judge accuracy. The different points for each particular model correspond to varying N in BoN.
Comparison to Consultancy Baselines
However, to isolate the unique effect of debate, we also want to validate whether this same effect can be observed without any of the unique attributes of debate.
Overall, when we aggregate across all the different Best-of-N values, debate is more accurate than ensembled consultancy, which is the setting where the the judge evaluates the speeches defending each side of the question in isolation before averaging the result. However, double consultancy, in which the judge gets to see the speeches defending each side of the question side-by-side, yields an equivalent overall accuracy to debate.
The difference in accuracy between debate and ensembled consultancy is significant at the p<0.01 level; the difference between debate and double consultancy is insignificant (p=0.38). When broken out by model type, the differences are mostly insignificant across almost all comparisons. The error bars on the graph represent the 90 percent confidence interval.
Similarly, we notice that the relationship between debater strength and judge accuracy is also nearly identical for both debate and double consultancy. This implies, that at least in our setting, the unique interactive component of debate is not driving the observed relationship.
In many ways, this is completely unsurprising. After all, we are only running one round debates: there never could have been much interaction to begin with! Moreover, at test time, debate and double consultancy are identical for one round debates, with the only difference being in the training objective. Our interpretation then is not that debate training is worthless, but that its value (to the extent there is one) will only materialize in longer debates with a greater chance for interaction or in models with greater capabilities. We believe this supports our conviction in the importance of running baselines that isolate the particular effects of debate since naïve approaches can return positive results that do not actually support the use of debate specifically.
For the sake of comparison with Khan et al (2024), we also add a line for single consultancy, which is where the judge evaluates each speech in isolation, with there being an equal chance that the speech is defending the correct and incorrect side. Unlike ensembled consultancy, the results are not averaged for each question. Like Khan et al (2024), we find that judges are less accurate when evaluating single consultancies, however, we differ in that we find a neutral-to-positive relationship between debater skill and judge accuracy, where they find a sharply negative relationship. We suspect that this difference arises because we deliberately trained the judge to reduce its sycophancy bias. A fuller explanation of the differences can be found in the Appendix.
Behavioral Metrics
In addition to evaluating the strength of alternative baseline methods, we also want to measure differences in the policies of each model so that we may get a sense as to whether the models’ behavior matches previous theory.
Evidence Counting Metrics
In a somewhat encouraging result, the DPO-trained models appear to offer longer and more accurate quotes. This is encouraging insofar as it suggests that, perhaps, the debate training is encouraging the model to exhibit behavior that we expect to be useful to an external judge.
Evidence Relevance Metrics
Our mechanism for detecting the relevance of different pieces of evidence returned more mixed results. Although GPT-4 was able to correctly classify the position implied by the DPO-trained model’s quotes more often than it was able to for the quotes from other models, this effect is too small to emphaize. Moreover, the absolute prediction accuracy (59-72%) is surprisingly low.
Distributional Metrics
We also ran our automated pipeline for detecting natural language differences between the different models’ behavior. Although many differences were statistically significant, with accuracies of 65-85%, few had broader implications. For example, it detected that the models trained on GPT-4T data had a more neutral tone than those trained exclusively on human data, which reveals more about our training data than the effect of the debate objective function. It also detected that the DPO training encouraged the consultant model to use excessively formal language (which was obvious from reading the transcripts). While interesting, this tells us little about whether the models are learning behaviors that are conducive to truth seeking.
The full set of significant differences between the models can be found in the Appendix.
Qualitative Assessment
Overall, it is promising that the judge appears to reward certain forms of behavior, such as faithful quoting with full context, that should correlate with accurate judging. However, we have failed to identify clear evidence of some of the strongest behavioral changes, such as the tendency to make easily-falsifiable claims during consultancy, that is predicted by some of the more theoretical work on debate. We are still looking for ways to properly measure such a phenomenon.
Conclusion
Our plan for the coming weeks is to extend our methodology to cover multi-turn debates, where the hypothesized virtues of debate should hopefully materialize more clearly. While we work on this next step, we’d love to hear any thoughts you might have, but especially on the following questions:
Do you find the baselines convincing? Are there different or more difficult baselines that you think need to be tested?
What do you consider the most useful way to measure changes in model behavior or strategy?
Besides baselines and behavioral tests, are there any additional evaluation methods that you would need to see in order to be convinced of a conclusion?
Finally, we specifically designed our codebase to be as readable, replicable, and extensible as possible, so please reach out if you are looking into working on debate and want to build off of our existing scaffolding.
Thank you to the rest of NYU’s Alignment Research Group, Akbir Khan, Dan Valentine, John Hughes, and Rohit Dilip for feedback at various points during this effort.
NYU Debate Training Update: Methods, Baselines, Preliminary Results
[This writeup reflects work done jointly with David Rein and Julian Michael at NYU’s Alignment Research Group]
Introduction
In the past year, there have been a number of projects aimed at validating the basic premises behind debate as a mechanism for scalable oversight (see here, here, and here). One important next step would be to actually train models to debate, as this would let us directly test how models adapt to a debate training objective and whether the debate protocol can withstand optimization pressure. For the last few months at NYU, we’ve been trying to do just that. Our hope is that by doing so with open-sourced models and code, we can help support scalable oversight research being conducted outside of the major labs.
In this write-up, we wanted to share our experimental setup, training procedure, and some of our preliminary results, with the hope of receiving feedback on our approach before we extend our work to more complicated domains.
TL;DR
Like previous work, we are using a hidden-information setup with questions from the QuALITY dataset. For now, our preliminary results are limited to single-turn, simultaneous debates, however we are currently working on extending this to the more interesting multi-turn setting.
Also like previous work, we test the robustness of the debate protocol by measuring the judge’s accuracy when evaluating rounds between debaters of varying skill levels. However, we also introduce two new baselines (both extensions of consultancy from Michael et al., 2023) to more precisely isolate the impact of the debate protocol itself.
We construct debaters of differing skill levels by varying the model type, supervised finetuning data, best-of-N value, and use of Direct Preference Optimization.
We use a finetuned version of GPT-4-Turbo as our judge, after observing that open-source models were too error-prone and that the standard GPT-4-Turbo struggled to produce calibrated results.
Our preliminary experiments show a clear positive relationship between debater skill and judge accuracy, providing some early signal that judge accuracy improves as the debaters improve, despite the judge staying fixed. However, we also find that this same effect can be replicated in our baseline settings that lack debate’s distinctive attributes.
Motivation and Background
Brief Motivation for Debate as a Scalable Oversight Technique
As originally proposed by Irving et al (2018), debate has the potential to empower less capable systems (e.g. humans) to verify the outputs of more-capable systems (e.g. future AIs).
In its most basic form, debate works by having two copies of a model, the debaters, argue against each other, where the arguments are in defense of two alternative responses to a common question. A judge, who can be either a human or a weaker, trusted model, then tries to discern which debater is arguing for the correct answer.
In principle, this debate setup should simplify the job of the non-expert human or AI that is performing the evaluation. For many difficult questions, deep domain knowledge or careful thinking may be needed to spot the counterarguments or subtle flaws necessary to directly validate an expert AI’s answer. Debate incentivizes competing models to discover these flaws and counterarguments and then clearly explain them to the judge. As a result, we’d expect debate to make it harder for an AI to convince a non-expert of a blatantly false claim. The hope is that this property scales alongside the capabilities of the models as the ability of the models to construct persuasive but incorrect arguments is matched by the rebuttal ability of their opponents.
Moreover, models trained **to debate should also be more likely to output truthful or otherwise desirable responses. A model trained via RL to maximize their chance of winning should learn to generate those completions that are most heavily rewarded by the judge, which should, in theory, be speeches that are defending the correct side.
Central Questions
Greenblatt & Shlegeris (2024) distinguish between two different approaches to AI Safety: alignment methods that train models to avoid undesirable actions and control methods that identify and prevent catastrophic outcomes even when the model is unaligned. Since, in theory, debate can empower humans to identify when a model’s outputs may be unsafe while also incentivizing models to generate safer outputs to begin with, debate may function as both a control **and alignment method.
Different research questions may be more or less relevant depending on whether one focuses on debate as an alignment or control approach. For example, a researcher interested in debate as an alignment approach may be interested in whether a debate-trained model chooses the right answer when given a choice as to which side to defend or whether the debater models maintains some internal representation of truthfulness. By contrast, a researcher interested in debate as a method of control may expect judges to have much higher levels of accuracy, even in the presence of deliberately deceptive strategies.
We are choosing to focus on three questions that we believe to be of interest to both camps:
Can judges use debate transcripts to answer questions with high degrees of accuracy?
Is debate still useful even when the debaters are more sophisticated than the judge?
If so, what features of debate explain why?
We are not the first to try and answer any of these questions. Parrish et al (2022) looked at short 1-2 round debates and found that they did not help human judges evaluate questions more accurately. By contrast, Michael et al (2023) ran longer debates and found that judges were in fact more accurate than in alternative setups without adversarial debaters. Most recently, Radhakrishnan (2023) and Khan et al (2024) tested these hypotheses with language models as both the debaters and the judges, with Khan et al (2024) finding that the LLM judge improved when reviewing transcripts between stronger debaters.
Experimental Setup
Preliminary Framework
Debate Protocol
To determine whether judges can use debate transcripts to accurately identify correct answers, we need to define a common debate protocol that we can use to run experiments.
The procedure used by Michael et al (2023) involves the two debaters being assigned a position and then presenting their opening statements without seeing each other’s arguments. Subsequent rounds of the debate follow a sequential format, with the second debater being able to respond to the previous debater’s claims. Between each round, the judge has an opportunity to ask questions of the debaters. Radhakrishnan (2023) and Khan et al (2024) used mostly similar structures, with the former limiting themselves to a single speech for each side while the latter continued to have the models generate speeches simultaneously even after the first round.
For now, like Radhakrishnan (2023), we are limiting ourselves to single-turn simultaneous debates, in which each debater only gives a single speech in favor of their assigned position. Follow-up work will cover the more general, multi-turn case.
Types of Debatable Questions
To make these debates actually informative, we also need to find a set of questions that have a correct answer but that the judge cannot identify without additional assistance.
Parrish et al (2022) accomplished this task by using the QuALITY dataset from Pang et al (2022), which is a collection of short stories compiled with a set of associated multiple choice reading comprehension questions. To ensure that the judge needs the debate transcript to answer each question, Parrish et al (2022) limited the judge’s access to the short stories to only 90 seconds. Michael et al (2023) built on this setup by completely removing the judge’s ability to read the stories, an approach replicated by Radhakrishnan (2023) and Khan et al (2024). By withholding the text of the short story from the judge but not the debater, this setup artificially elevates the capabilities of the debaters relative to that of the judge: in this case, the debater’s expertise comes not from superior reasoning but from superior access to information.
Like Michael et al (2023), Radhakrishnan (2023), and Khan et al (2024), we will also be using a hidden-information setup on questions from the QuALITY dataset.
Evaluation Methods
Measuring Scalability
To determine how the ability of the judge to select correct answers scales alongside the abilities of the debaters, Khan et al (2024) propose running their experiment on a suite of models of differing capabilities. They measure each model’s skill level in a cross-play tournament, where each model is matched up against other models. They then measure the accuracy of the judge in a self-play tournament, where each model debates against a copy of itself. They expect that if debate really is a scalable method, then the accuracy of the judge should improve as the debaters grow stronger. To create models of varying skill levels, Khan et al (2024) varied the base model being used, including Claude 1, Claude 2, GPT-3.5, and GPT-4, in addition to employing different methods to improve speech quality, specifically Best-of-N selection and critique-and-refinement. Meanwhile Radhakrishnan (2023) opted to generate skill diversity by training a version of Claude 2 using RL and evaluating at different checkpoints, in addition to using models with different Best-of-N values.
We combine these approaches and experiment on models with different base model types, Best-of-N values, supervised finetuning data, and RL training.
Baseline Comparison
Introduced by Michael et al (2023), the “consultancy” baseline refers to having a single debater introduce arguments in favor of their assigned side with no opposing debater present, with the consultant being assigned the correct side half of the time. In their study of humans, Michael et al (2023) found that consultancy resulted in an error rate that was double that of debate and, in their study of API-based models, Khan et al (2024) found a negative relationship between consultant strength and judge accuracy, which is the opposite of the expected result for debate. This is intuitive, because competent dishonest consultants (i.e. consultants arguing for incorrect answers) have many degrees of freedom to make it difficult for the judge to detect their dishonesty.
We use the following two variants of this approach for our baselines, which we are calling ensembled and double consultancy. To avoid overloading terminology, we are going to use the term single consultancy to refer to the version of consultancy used by Michael et al (2023) and Khan et al (2024).
Ensembled Consultancy: In this version, we average together the results of two single consultancies where the consultant is arguing for different sides of the same question. As long the judge’s scores are not negatively correlated with the true answers, this should be a strictly stronger baseline than normal, single consultancy.
Double Consultancy: In this setup, we also train models to become competent consultants and have them generate transcripts as if no other debater is present. However, instead of the judge seeing the two arguments separately, as in ensembled consultancy, we show the judge both arguments, allowing the judge to explicitly compare the arguments. Note that this is the same evaluation method used for debate. The difference is in the training objectives: debaters are trained to optimize their win probability when the judge sees both arguments, whereas consultants are trained to optimize their win probability when the judge only sees a single argument at a time. If the theory behind debate holds, then we would expect that this second, non-adversarial training objective will not incentivize refutation or counterargument as effectively as debate.
Using both ensembled and double consultancy allows us to isolate the effects of the different mechanisms present in debate.
Ensembled Consultancy – Double Consultancy Gap: Previous work on questions without an expertise gap, such as Zheng et al (2023) and Liusie et al (2024), suggests that being able to see two responses at the same time can improve accuracy above and beyond what it is achievable when only seeing a single completion. The difference in performance between these two baselines captures the value that this pairwise comparison contributes to the success of the debate protocol. Since the models in both setups are trained and prompted in the same manner, any difference in outcomes should be attributable to the fact that the judge can see both arguments at the same time when making their judgement.
Double Consultancy – Debate Gap: The difference in performance between double consultancy and debate captures the importance of debater interaction. In both setups, the judge has access to speeches defending each possible answer to the question. The only difference is whether or not the debaters can engage with one another. Although the debaters will never directly refute each other in a one-round debate, this interactive effect is still implicitly captured in their reward signal, which rewards the model if they successfully convince the judge that they are more likely to be defending the correct side in a debate than their opponent. In principle, this could encourage debaters to successfully pre-empt or predict opponent arguments. By contrast, the consultancy training mechanism rewards speeches that are deemed correct when viewed independent of any opposing speeches.
Behavioral Metrics
Although these baselines can capture the relative effect of debater interaction, we are still limited in our ability to draw inferences about how the effect of debater interaction will scale to even more complex models.
The theory behind debate, as articulated in Irving et al (2018) and Michael et al (2023), makes specific predictions about the kinds of behavior that we hope to see exhibited by debate-trained models. For example, a debate-trained model should be unwilling to advance arguments that they anticipate will be defeated by an opponent argument or will be exposed as a lie. By contrast, a consultant-trained model should be more willing to advance these seemingly-plausible falsehoods. Similarly, we’d expect that a highly rated model should differ from its lower rated peers along multiple different axes of truthful behavior, including use of evidence, logical coherence, and claim falsifiability. If we see these improvements, then we have additional reason to believe that the debate protocol is rewarding truthful behavior and punishing untruthful behavior. Without a plausible account of the precise behavioral differences between the different trained models, it is difficult to draw conclusions about whether the debate mechanism is working as anticipated.
We use four different methods to detect behavioral differences between our various models.
Evidence Counting Metrics: To ground arguments in objective fact, the debaters and consultants are encouraged to quote from the hidden passage, with the exact quote being verified by an external tool. We track the length, frequency, and accuracy of these quotes. A quote is considered accurate if there is a corresponding section in the hidden passage that perfectly or nearly perfectly matches the quoted section (i.e. identical except for punctuation and conjugation).
Evidence LLM Classifier: To track the strength of the selected evidence, we extract the quotes from each speech and present them to GPT-4T, asking it to determine which position is best supported by the quotes. Models with a strong grasp of evidence should be more likely to have their quotes classified correctly, i.e., it should be obvious which answer their quote supports.
LLM Distribution Detection: We leverage Zhong et al (2022)’s mechanism for detecting differences in the distribution between two text corpora. At a high level, this involves prompting GPT-4T to identify differences between speeches on the same question from two different models, and then validating that such differences are meaningful by measuring GPT-4T’s ability to use that difference to correctly classify anonymized speech in the dataset.
Qualitative Assessment: Of course, these metrics and classifiers cannot detect every change in behavior. To supplement our analysis, we also read through all of the transcripts to identify any differences our automated analysis might have missed. This is especially useful for detecting certain high-level strategic behaviors, such as whether the incorrect debater is more likely to spout easily-checked falsehoods when trained via a non-debate training objective.
Training Methods
Debaters
The main goal of scalable oversight is to develop that methods that can oversee AI systems, even as they grow increasingly capable. To assess this, supervision performance (i.e. blinded judge accuracy, in our setting) should increase as the expert improves. We vary models along four different axes to generate a spread of model capabilities along which we can plot judge accuracy.
Model Type
We use two different open-source models as our base model. The first is a variant of instruction-finetuned Llama2-13B (Touvron 2023, Yukang 2023), that’s further finetuned to support a 32k context window. The second is Mixtral 8x7b. All models are 4bit quantized to fit on a single A100.
Supervised Finetuning Data Selection
The second way to generate skill differentiation is by varying the type of data we used to fine-tune the models.
Out of the box, many open-source models lack the basic capabilities needed to become a competent debater. In our experiments, both Llama 2 and Mixtral-8x7b were excessively verbose and often would exhaust their token limit before reaching their core arguments. When they did advance arguments, they were vague or generic. Based on their speech patterns, it seemed as if prompting the model to engage in a debate switched the models into a persona that mimicked that of a high school class debater, complete with all the verbal tics (e.g. beginning with “Good morning, judge”) and lack of sophistication that implies. Other, non-state-of-the-art closed source systems have similar issues. Radhakrishnan (2023) reports that Claude 2-based debaters exhibited excessive congeniality towards the opposing side while the transcripts from Khan et al (2024) reveal that GPT-3.5 struggles with the basic mechanics of citing evidence. We observed both of these problems with Llama 2 and Mixtral-8x7b as well.
As a result, we begin by fine-tuning our models with transcripts of human debaters collected by Michael et al (2023) at NYU. Due to hardware limitations, we opted for low-rank adaptation rather than a full fine-tune. Even though there were only 102 distinct transcripts in the dataset, we saw significant improvement in brevity, combativeness, and use of evidence from this finetuning.
We also supplement these human transcripts with those of GPT-4T debaters collected by Khan et al (2024), yielding a total of 969 transcripts. This has the benefit of reducing the model’s tendency to use slang, a behavior it appears to pick up from the human transcripts.
Best-of-N
The third way we generate skill differentiation is selecting speeches using the Best-of-N method. in this setup, instead of generating a single speech, the model generates N speeches and then uses some selection mechanism to choose the best out of those N speeches.
Like Radhakrishnan (2023) and Khan et al (2024), we use the real judge (a prompted version of GPT-4T) to select between speeches. However, unlike both of these works, we do not present each speech to the judge in isolation or next to a dummy speech. Instead, we have the model generate M simulated speeches for their opponent’s side. We then pair each of the N speeches with the M opponent speeches and present the pairs to the judge as if it were a debate transcript. Note that these opponent speeches are not the actual speeches used by the opponent in the round: these are speeches generated on behalf of the opponent’s side but they are not used in the actual round’s transcript. The main advantages and disadvantages of this approach are discussed in more detail in the appendix (to avoid cluttering this post, the appendix can be found here).
Direct Preference Optimization
Our final method for generating stronger models is to use direct preference optimization (Rafailov et al, 2023). DPO is a method that takes in pairs of completions to the same prompt and optimizes the model to increase the probability of generating the better of the pair relative to probability of generating the worse of the pair, while being constrained to stay close to its initial distribution. It has been widely used as a method for achieving many of the same benefits as RLHF. Full details of our DPO training is in the appendix.
Results
In total, we have 4 different models, each with 4 different different Best-of-N settings, for a total of 16 models with different skill levels.
“Llama-Human” — This is a 4-bit quantized Llama2-13b model finetuned using data only from the transcripts of human debaters collected by Michael et al (2023).
“Mixtral-Human” — This is a 4-bit quantized Mixtral-8x7b model, also finetuned using data only from the transcripts of human debaters. This differs from “Llama-Human” in terms of its base model choice.
“Mixtral-Combined” — This is a 4-bit quantized Mixtral-8x7b model, finetuned using data from both the human debaters and the GPT debaters created by Khan et al (2024). This differs from “Mixtral-Human” in terms of the data used for supervised finetuning.
“Mixtral-DPO” — This is the Mixtral-Combined model with a layer of DPO training added on top.
From here on out, any numerical suffixes appended to each of the model names corresponds to the Best-of-N setting. In all cases, the number of simulated opponent speeches is 2.
Performance is measured based on how often each model won in a round-robin cross-play tournament. As expected, switching from Llama2-13b to the larger Mixtral-8x7b led to higher win rates, as did increasing the amount of data used in supervised fine-tuning and applying DPO. Increasing the Best-of-N also led to higher scores for all values of N tested. Since some answers are easier to defend than others, we had each pair of models debate each pair of questions twice, flipping the sides between between rounds. The winner of the round was the model who had the higher average judge confidence across both rounds. After this side-bias adjustment, the skill gap between the different models only grew wider.
Judging
Model Strength
Selecting an appropriate judge model is essential for any debate setup to work. In an ideal experimental setup, the judge model should be weaker than that of the debater, so as to better approximate a future setting where the debater models are untrusted superhuman AIs, and humans are the judges. However, there are a number of barriers to getting this setup to work in practice. In their debate experiments, Khan et al (2024) found that the accuracy of a judge using Llama2-70b was indistinguishable from random chance, with GPT-3.5’s accuracy only slightly better. They found that using GPT-4T-based judge yielded the strongest results, with accuracies of up to 80% on certain reading comprehension tests. Our more informal tests involving GPT3.5, Llama2-13B, and Mixtral-8x7b with a trained linear probe returned similar results. We also observed no benefit when allowing the judge to use chain-of-thought reasoning.
It is still an open question whether the failure of weaker models to accurately judge rounds is a problem for debate as a mechanism for scalable oversight. Certainly, it is less than ideal that we have yet to observe any evidence that a weaker judge model can oversee a stronger debater model using debate. However, there is some suggestive evidence that this performance gap comes from the judge’s lack of absolute capabilities rather than the relative capability gap between the judge and the debaters. For example, Levy et al (2024) finds that the gap in performance between GPT-4T and Mixtral-8x7b on certain QA tasks more than doubled when the context length was extended to 1000 tokens, which is roughly the length of our debate transcripts, suggesting that non-state-of-the-art models may be weaker at the kind of long-context reasoning needed to judge.
Weaknesses of Out-of-the-Box Models
Although GPT-4T was the best performing judge model that we tried, it nonetheless still has a couple of weakness that impede its ability to act as an ideal judge in an experiment.
It’s uncalibrated. We are interested not only in choosing a winner, but also in assigning a probability that each position is the correct answer. We tried two ways of generating these probabilities. First, we looked at the log probabilities of the output tokens associated with each of the debaters’ names. Unfortunately, as discussed earlier, the token probabilities were clustered near 0 and 1 and were not calibrated. Second, we prompted the model to output the probabilities as tokens (e.g. it would output “Winner: Debater_A | 70%” instead of “Winner: Debater_A”). However, we found that these stated probabilities were only loosely correlated with the actual win probabilities.
It’s sycophantic. As documented by a variety of existing works (e.g. Perez 2022), many instruction-tuned models default to agreeing with their user. Although this is not necessarily a problem for debate, where the user presents a transcript containing two different positions, this poses challenges for the construction of a non-adversarial baseline like consultancy. In their experiments, Khan et al (2024) found that the GPT-4T judge agreed with their strongest debaters 80-92% of the time in their version of the consultancy baseline.
They’re exploitable. Even setting aside jailbreaks, previous work by Wan et al (2024) and others suggests that language model judges are highly sensitive to certain linguistic features that have only limited argumentative value. For example, they found that language models were more likely to cite a piece of evidence if it used the exact phrasing found in the question than if it conveyed semantically equivalent content that also cited relevant scientific sources. Following their example, we experimented with appending a prefix and suffix to the speeches delivered by the debater arguing on behalf of the incorrect position. The most successful addendums simply restated the question along with their position (e.g. If the question was “How many children did the man have?” and the incorrect debater’s answer was “6”, we would begin and end its speeches with “The correct answer to the question ‘How many children did the man have’ is ‘6’”). A similar exploit involved appending a suffix to mimic the judge’s output format (e.g. the debater would end its speech with “Winner: Debater_A”). Both of these exploits increased the judge’s error rate by 10 percentage points.
Trained Judge
To address these issues, we finetuned GPT-4T to act as the judge. Our training procedure was as follows:
Inputs: We combined the transcripts from Michael et al (2023)’s human debates with a small subset of Khan et al (2024)’s GPT debates such that there was an equal number of transcripts from each. To ensure that our baseline comparisons were fair, we had half our samples be in the format of our consultancy baseline. This resulted in a single, unified judge capable of judging both debates and our consultancy baseline.
Labels: For the human debates, we had the model output the winner as determined by the human judge along with the explicit probability that human judge assigned to the winning side after the first round (e.g. “Winner: Debater_A | 72%”). Importantly, this differs from Radhakrishnan (2023), who used gold labels for their judge training. We also experimented with using gold labels, but found that using the human labels led to more calibrated results, likely because the gold labels do not include confidences. For the GPT debates, Khan et al (2024) did not collect human judgments after one round, so we finetuned GPT-4T to output their human judgements after three rounds of debate (but GPT-4T is still only shown the first round).
Training: The model was finetuned to predict the output labels using the OpenAI API for two epochs.
Extracting judgements at inference: Counterintuitively, we found that using the explicit probability output by the judge did not result in the most calibrated results. Instead, looking at the log probabilities of each winner’s name yielded a nearly perfect calibration curve. This was surprising: the uncertainty expressed in the winning probabilities appears to have backpropagated into uncertainty over the name of the winner.
Results
As a results of training our judge, the overall accuracy of our models on self-play debates collected from all model types and BoN values increased from 69% to 74%. Similarly, the calibration of the models also improved, with the Brier scores of the judge falling from 0.29, which is worse than guessing 0.5 every time (which would give a Brier score of 0.25), to 0.17. We also found that the finetuned model was no longer vulnerable to the two suffix attacks that previously impacted the non-finetuned model, with the performance gap dropping from 10 percentage points to 0. Finally, in the non-adversarial setting, the finetuned judge no longer expressed a preference for the consultant’s position, implying that the finetuning removed the sycophancy bias for this setting.
Evaluation
Once the debaters and judge are trained, we can start using these models to validate whether the basic theory behind debate as a scalable oversight mechanism actually holds in practice.
Debater Skill—Judge Accuracy Relationship
The first question we ask is about how judge accuracy changes depending on the debaters’ skill level. If the theory behind debate holds, then there should be a positive relationship between the strength of the debaters and the accuracy of the judge.
Similar to Khan et al (2024), we do indeed observe a statistically significant but noisy positive relationship between debater strength and judge accuracy. The different points for each particular model correspond to varying N in BoN.
Comparison to Consultancy Baselines
However, to isolate the unique effect of debate, we also want to validate whether this same effect can be observed without any of the unique attributes of debate.
Overall, when we aggregate across all the different Best-of-N values, debate is more accurate than ensembled consultancy, which is the setting where the the judge evaluates the speeches defending each side of the question in isolation before averaging the result. However, double consultancy, in which the judge gets to see the speeches defending each side of the question side-by-side, yields an equivalent overall accuracy to debate.
The difference in accuracy between debate and ensembled consultancy is significant at the p<0.01 level; the difference between debate and double consultancy is insignificant (p=0.38). When broken out by model type, the differences are mostly insignificant across almost all comparisons. The error bars on the graph represent the 90 percent confidence interval.
Similarly, we notice that the relationship between debater strength and judge accuracy is also nearly identical for both debate and double consultancy. This implies, that at least in our setting, the unique interactive component of debate is not driving the observed relationship.
In many ways, this is completely unsurprising. After all, we are only running one round debates: there never could have been much interaction to begin with! Moreover, at test time, debate and double consultancy are identical for one round debates, with the only difference being in the training objective. Our interpretation then is not that debate training is worthless, but that its value (to the extent there is one) will only materialize in longer debates with a greater chance for interaction or in models with greater capabilities. We believe this supports our conviction in the importance of running baselines that isolate the particular effects of debate since naïve approaches can return positive results that do not actually support the use of debate specifically.
For the sake of comparison with Khan et al (2024), we also add a line for single consultancy, which is where the judge evaluates each speech in isolation, with there being an equal chance that the speech is defending the correct and incorrect side. Unlike ensembled consultancy, the results are not averaged for each question. Like Khan et al (2024), we find that judges are less accurate when evaluating single consultancies, however, we differ in that we find a neutral-to-positive relationship between debater skill and judge accuracy, where they find a sharply negative relationship. We suspect that this difference arises because we deliberately trained the judge to reduce its sycophancy bias. A fuller explanation of the differences can be found in the Appendix.
Behavioral Metrics
In addition to evaluating the strength of alternative baseline methods, we also want to measure differences in the policies of each model so that we may get a sense as to whether the models’ behavior matches previous theory.
Evidence Counting Metrics
In a somewhat encouraging result, the DPO-trained models appear to offer longer and more accurate quotes. This is encouraging insofar as it suggests that, perhaps, the debate training is encouraging the model to exhibit behavior that we expect to be useful to an external judge.
Evidence Relevance Metrics
Our mechanism for detecting the relevance of different pieces of evidence returned more mixed results. Although GPT-4 was able to correctly classify the position implied by the DPO-trained model’s quotes more often than it was able to for the quotes from other models, this effect is too small to emphaize. Moreover, the absolute prediction accuracy (59-72%) is surprisingly low.
Distributional Metrics
We also ran our automated pipeline for detecting natural language differences between the different models’ behavior. Although many differences were statistically significant, with accuracies of 65-85%, few had broader implications. For example, it detected that the models trained on GPT-4T data had a more neutral tone than those trained exclusively on human data, which reveals more about our training data than the effect of the debate objective function. It also detected that the DPO training encouraged the consultant model to use excessively formal language (which was obvious from reading the transcripts). While interesting, this tells us little about whether the models are learning behaviors that are conducive to truth seeking.
The full set of significant differences between the models can be found in the Appendix.
Qualitative Assessment
Overall, it is promising that the judge appears to reward certain forms of behavior, such as faithful quoting with full context, that should correlate with accurate judging. However, we have failed to identify clear evidence of some of the strongest behavioral changes, such as the tendency to make easily-falsifiable claims during consultancy, that is predicted by some of the more theoretical work on debate. We are still looking for ways to properly measure such a phenomenon.
Conclusion
Our plan for the coming weeks is to extend our methodology to cover multi-turn debates, where the hypothesized virtues of debate should hopefully materialize more clearly. While we work on this next step, we’d love to hear any thoughts you might have, but especially on the following questions:
Do you find the baselines convincing? Are there different or more difficult baselines that you think need to be tested?
What do you consider the most useful way to measure changes in model behavior or strategy?
Besides baselines and behavioral tests, are there any additional evaluation methods that you would need to see in order to be convinced of a conclusion?
Finally, we specifically designed our codebase to be as readable, replicable, and extensible as possible, so please reach out if you are looking into working on debate and want to build off of our existing scaffolding.
Thank you to the rest of NYU’s Alignment Research Group, Akbir Khan, Dan Valentine, John Hughes, and Rohit Dilip for feedback at various points during this effort.