A Small Negative Result on Debate
Some context for this new arXiv paper from my group at NYU:
We’re working toward sandwiching experiments using our QuALITY long-document QA dataset, with reading time playing the role of the the expertise variable. Roughly: Is there some way to get humans to reliably answer hard reading-comprehensions questions about a ~5k-word text, without ever having the participants or any other annotators take the ~20 minutes that it would require to actually read the text.
This is an early writeup of some negative results. It’s earlier in the project that I would usually write something like this up, but some authors had constraints that made it worthwhile, so I’m sharing what we have.
Here, we tried to find out if single-turn debate leads to reliable question answering: If we give people high-quality arguments for and against each (multiple-choice) answer choices, supported by pointers to key quotes in the source text, can they reliably answer the questions under a time limit?
We did this initial experiment in an oracle setting; We had (well-incentivized, skilled) humans write the arguments, rather than an LM. Given the limits of current LMs on long texts, we expect this to give us more information about whether this research direction is going anywhere.
It didn’t really work: Our human annotators answered at the same low accuracy with and without the arguments. The selected pointers to key quotes did help a bit, though.
We’re planning to keep pursuing the general strategy, with multi-turn debate—where debaters can rebut one another’s arguments and evidence—as the immediate next step.
Overall, I take this as a very slight update in the direction that debate is difficult to use in practice as an alignment strategy. Slight enough that this probably shouldn’t change your view of debate unless you were, for some reason, interested in this exact constrained/trivial application of it.
crossposting my comments from Slack thread:
Here are some debate trees from experiments I did on long-text QA on this example short story:
Tree
Debater view 1
Debater view 2
Our conclusion was that we don’t expect debate to work robustly in these cases. In our case this was mostly because in cases where the debate is things like ’is there implied subtext A?’, human debaters don’t really know why they believe some text does or doesn’t have a particular implication. They have some mix of priors about what the text might be saying (which can’t really be justified with debate), and various updates to that based on style, word choice, etc, where humans don’t necessarily have introspective access to what exactly in the text made them come to the conclusion.My guess is that’s not the limitation you’re running into here—I’d expect that to just be the depth.
There are other issues with text debates, like if the evidence is distributed across many quotes that each only provide a small amount of evidence—in this case the honest debater needs to have decent estimates for how much evidence each quote provides, so they can split their argument into something like ‘there are 10 quotes that weakly support position A’; ‘the evidence that these quotes provide is additive rather than redundant’.
[edited to fix links]
Yep. (Thanks for re-posting.) We’re pretty resigned to the conclusion that debate fails to reach a correct conclusion in at least some non-trivial cases—we’re mainly interested in figuring out (i) whether there are significant domains or families of questions for which it will often reach a conclusion, and (ii) whether it tends to fail gracefully (i.e., every outcome is either correct or a draw).
Update: We did a quick follow-up study adding counterarguments, turning this from single-turn to two-turn debate, as a quick way of probing whether more extensive full-transcript debate experiments on this task would work. The follow-up results were negative.
Tweet thread here: https://twitter.com/sleepinyourhat/status/1585759654478422016
Direct paper link: https://arxiv.org/abs/2210.10860 (To appear at the NeurIPS ML Safety workshop.)
We’re still broadly optimistic about debate, but not on this task, and not in this time-limited, discussion-limited setting, and we’re doing a broader more fail-fast style search of other settings. Stay tuned for more methods and datasets.
I think that one of the key difficulties for debate research is having good tasks that call for more sophisticated protocols. I think this dataset seems great for that purpose, and having established a negative result for 1-turn debate seems like a good foundation for follow-up work exploring more sophisticated protocols. (It seems like a shame that people don’t normally publish early-stage and negative results.)
In comparison with other datasets (e.g. in the negative results described by Beth), it seems like QuALITY is identifying pretty crisp failures and is within striking distance for modern ML. I haven’t looked at the dataset beyond the samples in the paper but tentatively I’m pretty excited about more people working on it (and excited to see future work from your group!)
I do strongly suspect that multi-turn debates could handle these questions, and if not it would be a pretty significant update about debate / the nature of human reasoning / etc. I think it’s possible those debates would have to get pretty complicated, and it’s also quite plausible that it will be easier to get something else to work. In any case, i feel like the problem is a close enough match for what we care about that doing “whatever it takes” will probably generally be pretty interesting.
Do you have suggestions for domains where you do expect one-turn debate to work well, now that you’ve got these results?
I have no reason to be especially optimistic given these results, but I suppose there may be some fairly simple questions for which it’s possible to enumerate a complete argument in a way that flaws will be clearly apparent.
In general, it seems like single-turn debate would have to rely on an extremely careful judge, which we don’t quite have, given the time constraint. Multi-turn seems likely to be more forgiving, especially if the judge has any influence over the course of the debate.
If there are high-quality arguments for multiple answers, doesn’t that “just” mean that the multiple-choice question is itself low-quality?
One of the arguments is quite misleading in most cases, so probably not high-quality by typical definitions. Unfortunately, under the time limit, our readers can’t reliably tell which one is misleading.
Without arguments and without the time limit, annotators get the questions right with ~90% accuracy: https://arxiv.org/abs/2112.08608
Did your description to the participants state that the arguments were high-quality?
I can look up the exact wording if it’s helpful, but I assume it’s clear from the basic setup that at least one of the arguments has to be misleading.
I don’t know about anyone else; under time pressure I personally would go about looking for ‘one wrong argument in a sea of high-quality arguments’ very differently than I would go about looking for ‘one misleading but superficially high-quality argument in a sea of high-quality arguments’ or ‘one very-high-quality argument in a sea of high-quality arguments’.