“More persuasive” here means a higher win rate in debate, which I think is the same thing it would mean in any debate context? I agree the limitation to inference time rather than training is definitely important to keep in mind. I think that best-of-N using the judge as a preference model is a reasonable approximation of moderate amounts of RL training, but doing actual training would allow us to apply a lot more optimization pressure and get a wider spread of Elos. There has been some good debate RL work done in a similar setting here, and I’d love to see more research done with debate-trained models.
Dan Valentine
Thanks for the feedback Ryan!
I like this paper, but I think the abstract is somewhat overstated.
This is good to know. We were trying to present an accurate summary in the abstract while keeping it concise, which is a tricky balance. Seems like we didn’t do a good enough job here, so we’ll update the abstract to caveat the results a bit more.
Hidden passage debate on QuALITY is actually pretty narrow as far as domains go and might have pretty different properties from future cases.
Yep, agreed! QuALITY is a great testbed for debate, but we definitely need to see debate results in other domains. The NYU ARG stream in MATS is looking at some other LLM debate domains right now and I’m very keen to see their results.
My understanding is that there are a bunch of negative results on other domains and perhaps on other variants of the QuALITY task.
Yeah we tried a bunch of other tasks early on, which we discuss in Appendix C. Originally we were using debate with symmetric information to try to improve judge performance on various datasets above their 0-shot performance. This didn’t work for a few reasons:
As you mentioned, it seems like GPT-4 class models are the minimum capability level needed to be a reasonable judge. You can see this in Figure 1 of the paper—for the GPT-4-Turbo judge, debate massively beats baselines. For Claude-2.1, debate only slightly helps, and for GPT-3.5, there’s no clear signal. We tried judges weaker than GPT-4 a bunch and didn’t get anywhere with them.
Using GPT-4 as both debater and judge without information asymmetry (which we call the “self-improvement” setting) seemed pretty hard—I think adding the debate transcript adds a lot of noise to the judge’s decision making, which mostly degrades performance in cases where 0-shot is already very high. In cases where 0-shot performance is poor, that also means the debaters lack the capability to present valid arguments.
It still seems plausible to me that with the right prompting/boN/other scaffolding, it would be possible to use debate effectively here. We also didn’t do any training at all. I don’t think our negative results are very strong evidence (we just tried for a few weeks before moving to QuALITY).
I don’t think the usefulness of debate for oversight depends on whether debate for self-improvement works. Getting good accuracy with a weaker judge seems much more on target for the kind of thing we want to use debate for. I think hidden information is a pretty good setup for this now, but when GPT-5 comes out I would like to see symmetric information debates run with a GPT-4 judge. If that doesn’t improve GPT-4 0-shot on tasks where there is a big 4->5 0-shot gap, that would be an update against debate for me.
Our headline result, that judge accuracy scales with debater Elo, was only tested in QuALITY. I’d be pretty interested if someone tried to replicate that finding on other datasets. Even without beating 0-shot, it would be nice to see that trend.
I’d be interested in debate results where we have human debators and GPT-4 as a judge. (Unless this is already in this paper? I don’t see it, but I haven’t read the results in detail yet.) I think this seems somewhat analogous to the case where we have AI debators and human judges (judge and debators have different capability profile, debators might understand a bunch of judge weaknesses, etc).
So we did check something similar—we ran our GPT-4 judge on the human debate transcripts from Michael et al. We found that debate accuracy was higher than consultancy, and also that the inter-annotator agreement between human and GPT-4 judges was much higher in debate than in consultancy. These results didn’t make it into the paper, but maybe are worth adding to an appendix. Of course this is not the same as human debaters who know their judge will be an LLM—in that case I’d imagine debaters trying out a lot of weird adversarial strategies. I think I wouldn’t be surprised if such strategies worked to the point where our persuasiveness → judge accuracy relationship broke down, but I don’t think it would be a big update against debate for me—current LLMs are just very vulnerable to weird attacks compared to humans.
Debating with More Persuasive LLMs Leads to More Truthful Answers
Seems weird for this to be the same time and date as the Toronto meetup. Lots of people who might have been interested in going will probably be at the one in Toronto instead.
Understanding mesa-optimization using toy models
The bottleneck in this scenario becomes brain health, as receiving a brain transplant is not very useful. I’m not sure how much of an obstacle this will be in practice.
For a high level look at quantum physics I’d recommend Something Deeply Hidden by Sean Carroll. I feel like I understand many worlds much better after reading it. If you like audiobooks this one is great too.
[workshop] Detecting out of distribution data
My employer isn’t gonna allow me to take a couple months off to go do this thing I personally am very interested in
Have you considered asking them about it? I’ve worked at several software jobs where this would have been no problem. I’ve also seen a few people take sabbaticals and there was no issue with it, their teammates generally thought it was really cool. One guy I know took a 1-year sabbatical to live in a van and drive around Europe.
This is all anecdotal and your situation may be different of course. I just wanted to add this data point as it seemed like you may be prematurely dismissing sabbaticals as some crazy thing that never happens in real life.
The worst part is, for most of these, time lost is gone forever. It’s just a slowdown. Like the Thai floods simply permanently set back hard drive progress and made them expensive for a long time, there was never any ‘catchup growth’ or ‘overhang’ from it.
Isn’t this great news for AI safety due to giving us longer timelines?
I found your earlier comment in this thread insightful and I think it would be really valuable to know what evidence convinced you of these timelines. If you don’t have time to summarize in a post, is there anything you could link to?
How long do you expect the event to last for? I’d love to join but this week I’ll have to leave after the first hour.
[Online] EA Toronto Monthly Social
SSC Dublin Meetup
Update: Black Sheep is fully booked tomorrow, so the location has changed to Kimchi Hophouse!
Declarative and procedural knowledge are two different memory systems. Spaced repetition is good for declarative knowledge, but for procedural (like playing music) you need lots of practice. Other examples include math and programming—you can learn lots of declarative knowledge about the concepts involved, but you still need to practice solving problems or writing code.
Edit: as for why practice every day—the procedural system requires a lot more practice than the declarative system does.