julianjm comments on Open consultancy: Letting untrusted AIs choose what answer to argue for

julianjm 13 Mar 2024 0:11 UTC
LW: 9 AF: 5
3
AF
Thank you for posting this! I think it’s well worth thinking through debate baselines more.
TL;DR:
- I think your open consultancy idea isn’t specific to consultancy. I think everything about it equally well applies to debate, and it basically corresponds to how the RL training should work in practice.
- The ⁵⁰⁄₅₀ assumption in our consultancy and debate experiments was done for reasons having to do with our research question of interest, which is measuring the relative utility of the supervision paradigms. I don’t think we would want to directly use consultancy with a ⁵⁰⁄₅₀ prior to help us answer questions in practice and this wasn’t the intent behind our use of it as a baseline (speaking at least for myself).
In more detail:
- The way that debate training should work in practice is that the reward signal of winning the debate should feed back into the policy that decides the answer. So a model trained using debate also should choose the answer that it thinks will be most convincing to the judge. It’ll then debate that against the next most likely(-to-win) answer. You can show these predicted win probabilities to the judge and it would be much the same as showing the top two answers (and their win probabilities) to the judge in consultancy, then arguing for the one assigned higher probability. This seems to me like the debate analog of open consultancy and is what should be compared to IMO.
- It seems to me that doing CoT before deciding the answer is also possible in debate. You can just sample from or search/marginalize over the CoT distribution to get a distribution over answers and then use that; the reward signal can feed back into this in the same way as well.
So why did we use the ⁵⁰⁄₅₀ assumption for consultancy?
- The main reason is we want the final judge accuracies to be comparable. Suppose you’re comparing debate (with no prior over answers) to an RLHF’d model which you are pretty sure gets 90% accuracy. Then you could just completely defer to the model and get 90% accuracy while not supervising it at all, whereas you’d have no way of doing this with debate. A fairer comparison would be against a debate-trained model which assigned probabilities to its answers; whether this gets higher accuracy is its own interesting question and is a result we would hope for based on the findings that we already have.
  - Furthermore, if the prior accuracies differ then it complicates the comparison. Since we’re interested in the information gain from the supervision method, if the models started with different baseline accuracies, we would need to correct for that when comparing the methods, and the results wouldn’t be as interpretable (as we couldn’t directly compare the resulting judge accuracies). Plus it might be harder for human judges to stay calibrated.
    We’re interested in measuring the useful information the supervisor can get from the supervision method. If we already start with a lot of information (e.g., if the model is 90% accurate in the option it tells us is right) it becomes much harder to tell the difference between supervision methods, because the difference in information is going to be smaller since you can always do at least as well as deferring to the AI for a baseline. So for the purpose of comparing supervision paradigms most easily, starting from a ⁵⁰⁄₅₀ baseline (minimal starting information) makes it easier to spot an effect, if it exists.
- Since we haven’t gotten debate training to work yet, there’s no clear way to assign probabilities to the answers, so we can start with a prior of ⁵⁰⁄₅₀. Then we just need to do the same for consultancy.
So: overall, yes I think open consultancy is a good idea for a baseline, but I think the proper comparison would be to ‘open’ debate, and it seems to me that the CoT aspect is orthogonal. It’s possible I’ve missed something in your proposal though. Does this all sound reasonable to you?
Also, a side question we might have to deal with in practice in the ‘open’ case is ensuring judge attentiveness and preventing premature deferral to the model once it becomes reasonably trustworthy. I think this is what you were getting at near the end of your post. In practice this might just mean trying to focus supervision on the error cases, which could basically just function to bring the prior back towards ⁵⁰⁄₅₀ anyway.
- I think this is probably where the ‘open’ variant comparison gets more interesting and realistic, because what we’re really interested in is ensuring the robustness of the resulting system, and even if debate works much better when starting from the 50% baseline, I can imagine a world where consultancy catches up on a lot of that difference over training, or their error cases or answer distributions are significantly different, and what it takes to improve performance on the margin might look different between the two approaches in a way that we wouldn’t see just doing experiments with ⁵⁰⁄₅₀ priors on pre-selected answer pairs. I think we’re inevitably going to run into all of this stuff once we start doing RL training and letting the model pick the answers, which is where I see this research agenda going.
- Fabien Roger 13 Mar 2024 15:35 UTC
  LW: 2 AF: 1
  0
  AF Parent
  Thanks for the detailed answer!
  - I hadn’t really considered what happens when using debate as a reward signal, I agree you get the benefits of open consultancy, thanks for pointing it out!
  - Using regular consultancy as a “debugging baseline” makes sense to me, I often read paper as paper trying to answer a question like “how can we use untrusted experts” (and here you should go for the best alternative to debate robust against scheming) but they are sometimes about “how can we make debate work” (and here the debugging baseline is more information).
  - About the search process over CoT: I agree this is often a good way to produce the justification for the correct side of the argument. It might be a terrible way to produce the incorrect side of the argument, so I think “generate CoTs until you get one that supports your side” is not a drop-in replacement for “generate a justification for your side”. So you often have to combine both the CoT and the post-hoc justification, and what you get is almost exactly what I call “Using open consultancy as evidence in a debate”. I agree that this gets you most of the way there. But my intuition is that for some tasks, the CoT of the open consultant is so much evidence compared to justifications optimized to convince you of potentially wrong answer that it starts to get weird to call that “debate”.