I like this paper, but I think the abstract is somewhat overstated. In particular, instead of:
We find that debate consistently helps both non-expert models and humans answer questions,
I wish this was something more like:
On the QuALITY dataset and in the case where debators are given more knowledge than otherwise similar judges, we find that debate consistently helps both non-expert models and humans answer questions,
My complaints are:
Hidden passage debate on QuALITY is actually pretty narrow as far as domains go and might have pretty different properties from future cases. For instance, it seems to effectively assume that AIs and judges are equally smart and just have a knowledge disparity. This might be the future situation, but this is far from certain.
My understanding is that there are a bunch of negative results on other domains and perhaps on other variants of the QuALITY task. So readers should interpret these results as “we found a case where debate works well”, not “we picked an interesting task and debate worked”. In particular, a pretty natural domain is something like: physics questions where the judge is a notably weaker model and a baseline is just getting the judge to solve the problem itself.
To be clear, the first of those two complaints is discussed in the conclusion/discussion/limitation section, I just wish it was also touched on in the abstract. (It might also be nice to mention tenative negative results somewhere in the body, I don’t think I see this anywhere.)
I think the choices made in this paper are probably reasonable, and the fact that debate doesn’t yet work for AIs in non-asymmetric information cases (e.g. without needing a hidden passage) is probably due to not having smart enough AIs. That is:
We would want to run a debate where there isn’t any directly hidden information and we just have a weaker judge.
Then, in this setting, we could use as a baseline “just ask the judge” (which is a baseline we can use in practice with human judges).
But, it currently seems like only the very most powerful AIs are smart enough to be able to constructively judge debates. So, if we (e.g.) tried to use GPT-3.5 as a judge it would be too dumb, and if we used Claude or GPT-4 as a judge it would be too close in intelligence to the level of the debators to measure much effect.
I’d be interested in debate results where we have human debators and GPT-4 as a judge. (Unless this is already in this paper? I don’t see it, but I haven’t read the results in detail yet.) I think this seems somewhat analogous to the case where we have AI debators and human judges (judge and debators have different capability profile, debators might understand a bunch of judge weaknesses, etc).
I like this paper, but I think the abstract is somewhat overstated.
This is good to know. We were trying to present an accurate summary in the abstract while keeping it concise, which is a tricky balance. Seems like we didn’t do a good enough job here, so we’ll update the abstract to caveat the results a bit more.
Hidden passage debate on QuALITY is actually pretty narrow as far as domains go and might have pretty different properties from future cases.
Yep, agreed! QuALITY is a great testbed for debate, but we definitely need to see debate results in other domains. The NYU ARG stream in MATS is looking at some other LLM debate domains right now and I’m very keen to see their results.
My understanding is that there are a bunch of negative results on other domains and perhaps on other variants of the QuALITY task.
Yeah we tried a bunch of other tasks early on, which we discuss in Appendix C. Originally we were using debate with symmetric information to try to improve judge performance on various datasets above their 0-shot performance. This didn’t work for a few reasons:
As you mentioned, it seems like GPT-4 class models are the minimum capability level needed to be a reasonable judge. You can see this in Figure 1 of the paper—for the GPT-4-Turbo judge, debate massively beats baselines. For Claude-2.1, debate only slightly helps, and for GPT-3.5, there’s no clear signal. We tried judges weaker than GPT-4 a bunch and didn’t get anywhere with them.
Using GPT-4 as both debater and judge without information asymmetry (which we call the “self-improvement” setting) seemed pretty hard—I think adding the debate transcript adds a lot of noise to the judge’s decision making, which mostly degrades performance in cases where 0-shot is already very high. In cases where 0-shot performance is poor, that also means the debaters lack the capability to present valid arguments.
It still seems plausible to me that with the right prompting/boN/other scaffolding, it would be possible to use debate effectively here. We also didn’t do any training at all. I don’t think our negative results are very strong evidence (we just tried for a few weeks before moving to QuALITY).
I don’t think the usefulness of debate for oversight depends on whether debate for self-improvement works. Getting good accuracy with a weaker judge seems much more on target for the kind of thing we want to use debate for. I think hidden information is a pretty good setup for this now, but when GPT-5 comes out I would like to see symmetric information debates run with a GPT-4 judge. If that doesn’t improve GPT-4 0-shot on tasks where there is a big 4->5 0-shot gap, that would be an update against debate for me.
Our headline result, that judge accuracy scales with debater Elo, was only tested in QuALITY. I’d be pretty interested if someone tried to replicate that finding on other datasets. Even without beating 0-shot, it would be nice to see that trend.
I’d be interested in debate results where we have human debators and GPT-4 as a judge. (Unless this is already in this paper? I don’t see it, but I haven’t read the results in detail yet.) I think this seems somewhat analogous to the case where we have AI debators and human judges (judge and debators have different capability profile, debators might understand a bunch of judge weaknesses, etc).
So we did check something similar—we ran our GPT-4 judge on the human debate transcripts from Michael et al. We found that debate accuracy was higher than consultancy, and also that the inter-annotator agreement between human and GPT-4 judges was much higher in debate than in consultancy. These results didn’t make it into the paper, but maybe are worth adding to an appendix. Of course this is not the same as human debaters who know their judge will be an LLM—in that case I’d imagine debaters trying out a lot of weird adversarial strategies. I think I wouldn’t be surprised if such strategies worked to the point where our persuasiveness → judge accuracy relationship broke down, but I don’t think it would be a big update against debate for me—current LLMs are just very vulnerable to weird attacks compared to humans.
I think I agree with everything you said and I appreciate the level of thoughtfulness.
Yeah we tried a bunch of other tasks early on, which we discuss in Appendix C.
Great! I appreciate the inclusion of negative results here.
Of course this is not the same as human debaters who know their judge will be an LLM—in that case I’d imagine debaters trying out a lot of weird adversarial strategies.
Yep, I’d be interested in this setup, but maybe where we ban egregious jailbreaks or simillar.
I like this paper, but I think the abstract is somewhat overstated. In particular, instead of:
I wish this was something more like:
My complaints are:
Hidden passage debate on QuALITY is actually pretty narrow as far as domains go and might have pretty different properties from future cases. For instance, it seems to effectively assume that AIs and judges are equally smart and just have a knowledge disparity. This might be the future situation, but this is far from certain.
My understanding is that there are a bunch of negative results on other domains and perhaps on other variants of the QuALITY task. So readers should interpret these results as “we found a case where debate works well”, not “we picked an interesting task and debate worked”. In particular, a pretty natural domain is something like: physics questions where the judge is a notably weaker model and a baseline is just getting the judge to solve the problem itself.
To be clear, the first of those two complaints is discussed in the conclusion/discussion/limitation section, I just wish it was also touched on in the abstract. (It might also be nice to mention tenative negative results somewhere in the body, I don’t think I see this anywhere.)
I think the choices made in this paper are probably reasonable, and the fact that debate doesn’t yet work for AIs in non-asymmetric information cases (e.g. without needing a hidden passage) is probably due to not having smart enough AIs. That is:
We would want to run a debate where there isn’t any directly hidden information and we just have a weaker judge.
Then, in this setting, we could use as a baseline “just ask the judge” (which is a baseline we can use in practice with human judges).
But, it currently seems like only the very most powerful AIs are smart enough to be able to constructively judge debates. So, if we (e.g.) tried to use GPT-3.5 as a judge it would be too dumb, and if we used Claude or GPT-4 as a judge it would be too close in intelligence to the level of the debators to measure much effect.
I’d be interested in debate results where we have human debators and GPT-4 as a judge. (Unless this is already in this paper? I don’t see it, but I haven’t read the results in detail yet.) I think this seems somewhat analogous to the case where we have AI debators and human judges (judge and debators have different capability profile, debators might understand a bunch of judge weaknesses, etc).
Thanks for the feedback Ryan!
This is good to know. We were trying to present an accurate summary in the abstract while keeping it concise, which is a tricky balance. Seems like we didn’t do a good enough job here, so we’ll update the abstract to caveat the results a bit more.
Yep, agreed! QuALITY is a great testbed for debate, but we definitely need to see debate results in other domains. The NYU ARG stream in MATS is looking at some other LLM debate domains right now and I’m very keen to see their results.
Yeah we tried a bunch of other tasks early on, which we discuss in Appendix C. Originally we were using debate with symmetric information to try to improve judge performance on various datasets above their 0-shot performance. This didn’t work for a few reasons:
As you mentioned, it seems like GPT-4 class models are the minimum capability level needed to be a reasonable judge. You can see this in Figure 1 of the paper—for the GPT-4-Turbo judge, debate massively beats baselines. For Claude-2.1, debate only slightly helps, and for GPT-3.5, there’s no clear signal. We tried judges weaker than GPT-4 a bunch and didn’t get anywhere with them.
Using GPT-4 as both debater and judge without information asymmetry (which we call the “self-improvement” setting) seemed pretty hard—I think adding the debate transcript adds a lot of noise to the judge’s decision making, which mostly degrades performance in cases where 0-shot is already very high. In cases where 0-shot performance is poor, that also means the debaters lack the capability to present valid arguments.
It still seems plausible to me that with the right prompting/boN/other scaffolding, it would be possible to use debate effectively here. We also didn’t do any training at all. I don’t think our negative results are very strong evidence (we just tried for a few weeks before moving to QuALITY).
I don’t think the usefulness of debate for oversight depends on whether debate for self-improvement works. Getting good accuracy with a weaker judge seems much more on target for the kind of thing we want to use debate for. I think hidden information is a pretty good setup for this now, but when GPT-5 comes out I would like to see symmetric information debates run with a GPT-4 judge. If that doesn’t improve GPT-4 0-shot on tasks where there is a big 4->5 0-shot gap, that would be an update against debate for me.
Our headline result, that judge accuracy scales with debater Elo, was only tested in QuALITY. I’d be pretty interested if someone tried to replicate that finding on other datasets. Even without beating 0-shot, it would be nice to see that trend.
So we did check something similar—we ran our GPT-4 judge on the human debate transcripts from Michael et al. We found that debate accuracy was higher than consultancy, and also that the inter-annotator agreement between human and GPT-4 judges was much higher in debate than in consultancy. These results didn’t make it into the paper, but maybe are worth adding to an appendix. Of course this is not the same as human debaters who know their judge will be an LLM—in that case I’d imagine debaters trying out a lot of weird adversarial strategies. I think I wouldn’t be surprised if such strategies worked to the point where our persuasiveness → judge accuracy relationship broke down, but I don’t think it would be a big update against debate for me—current LLMs are just very vulnerable to weird attacks compared to humans.
Thanks for the response!
I think I agree with everything you said and I appreciate the level of thoughtfulness.
Great! I appreciate the inclusion of negative results here.
Yep, I’d be interested in this setup, but maybe where we ban egregious jailbreaks or simillar.