Hey this is cool work. I think i disagree with a couple of these comments coming from a Scalable Oversight backgroundrather than an AI control, and what weak/strong gaps we intend to look over.
My main contention is the framing that you can talk about Consultants producing arguments with convincingness scores $c_i$ and $c_c$, which are independent of the protocol and judge that you are using. I try to break this down into the two claims. I make my arguments backwards, but in particular, I contest the claims:
1) “Open consultancy is as robust against scheming as regular consultancy”
2) “Open consultancy is sometimes better than debate“
In particular, I think the confusion stems from this idea from debate:
The idea is that when it is easier to argue for something true than something false, then the training process will force the scheming model to argue for the truth.
This statement may be true in debate but has no grounding in consultancy. Judges are non-experts and are going to have a really hard time distinguishing between truth and lies in consultancy. I can generate plausible arguments as to why the sqrt(3) is rational and irrational (so a judge under consultancy would say this is okay), whilst in debate it’s the comparative analysis doing the work (and the judge would only choose one). This, again is going to be a larger problem in Scalable Oversight—where the expertise gap is the issue and not AI Control—where the gap is from scheming/audit budget.
Convincingness by protocol (arguments are equivalent in different protocols)
Another way to say this is for the same argument from either a correct consultant or correct debater, there is a different persuasiveness c_c and d_c depending on the protocol you are in.
From this it falls apart that c_c > c_i for consultancy (I have no reason to believe this is true). The other way to put this probably is that d_i + d_c = 1 for debate, but c_i + c_c is unbounded.
Now, moving to Toy Model 1: I appreciate you highlighting this as an unrealistic assumption:
Putting the correct justification and incorrect justification next to each other doesn’t change the convincingness of either side
But in either situation, I don’t think it is fair to compare to debate anymore, because the entire point of debate is the adversarial and comparative arguments. Lots of the juice is going to come from providing interactions between two experts, which will change the convincingness of an output. Again this, is more of an issue in Scalable Oversight—because here, we will ask questions where we have to exploit the Generator-Critique gap and not the Generator-Discriminator gap.
Convincingness of the Truth (true arguments convince better)
If there are no guarantees that c_c > c_i in consultancy, then I think analysis on open-consultancy > consultancy needs to be considered. The analysis of the c_c > c_i regime is not where we have any issues with open consultancy or regular consultancy, if anything, here is where this stuff is all gonna work fine! It is, instead, when we are in the regime where c_i < c_c, where using regular consultancy certainly trumps open consultancy because the schemer takes every chance they have to sandbag you.
Anyway, just my two cents. I think using open-consultancy is a better baseline for Control. In general, I urge folks to consider the persuasiveness or convincingness of arguments relative to the judge and the protocol being used. Hope that makes sense.
I agree that open consultancy will likely work less well than debate in practice for very smart AIs, especially if they don’t benefit much from CoT, in big part because interactions between arguments is often important. But I think it’s not clear when and for what tasks this starts to matter, and this is an empirical question (no need to argue about what is “grounded”). I’m also not convinced that calibration of judges is an issue, and that getting a nice probability matters as opposed to getting an accurate 0-1 answer.
Maybe the extent to which open consultancy dominates regular consultancy is overstated in the post, but I still think that you should be able to identify the kinds of questions for which you have no signal, and avoid the weird distributions of convincingness / non-expert calibrations where the noise from regular consultancy is better than actual random noise on top of open consultancy.
Hey this is cool work. I think i disagree with a couple of these comments coming from a Scalable Oversight background rather than an AI control, and what weak/strong gaps we intend to look over.
My main contention is the framing that you can talk about Consultants producing arguments with convincingness scores $c_i$ and $c_c$, which are independent of the protocol and judge that you are using. I try to break this down into the two claims. I make my arguments backwards, but in particular, I contest the claims:
1) “Open consultancy is as robust against scheming as regular consultancy”
2) “Open consultancy is sometimes better than debate“
In particular, I think the confusion stems from this idea from debate:
This statement may be true in debate but has no grounding in consultancy. Judges are non-experts and are going to have a really hard time distinguishing between truth and lies in consultancy. I can generate plausible arguments as to why the sqrt(3) is rational and irrational (so a judge under consultancy would say this is okay), whilst in debate it’s the comparative analysis doing the work (and the judge would only choose one). This, again is going to be a larger problem in Scalable Oversight—where the expertise gap is the issue and not AI Control—where the gap is from scheming/audit budget.
Convincingness by protocol (arguments are equivalent in different protocols)
Another way to say this is for the same argument from either a correct consultant or correct debater, there is a different persuasiveness c_c and d_c depending on the protocol you are in.
From this it falls apart that c_c > c_i for consultancy (I have no reason to believe this is true). The other way to put this probably is that d_i + d_c = 1 for debate, but c_i + c_c is unbounded.
Now, moving to Toy Model 1: I appreciate you highlighting this as an unrealistic assumption:
But in either situation, I don’t think it is fair to compare to debate anymore, because the entire point of debate is the adversarial and comparative arguments. Lots of the juice is going to come from providing interactions between two experts, which will change the convincingness of an output. Again this, is more of an issue in Scalable Oversight—because here, we will ask questions where we have to exploit the Generator-Critique gap and not the Generator-Discriminator gap.
Convincingness of the Truth (true arguments convince better)
If there are no guarantees that c_c > c_i in consultancy, then I think analysis on open-consultancy > consultancy needs to be considered. The analysis of the c_c > c_i regime is not where we have any issues with open consultancy or regular consultancy, if anything, here is where this stuff is all gonna work fine! It is, instead, when we are in the regime where c_i < c_c, where using regular consultancy certainly trumps open consultancy because the schemer takes every chance they have to sandbag you.
Anyway, just my two cents. I think using open-consultancy is a better baseline for Control. In general, I urge folks to consider the persuasiveness or convincingness of arguments relative to the judge and the protocol being used. Hope that makes sense.
I agree that open consultancy will likely work less well than debate in practice for very smart AIs, especially if they don’t benefit much from CoT, in big part because interactions between arguments is often important. But I think it’s not clear when and for what tasks this starts to matter, and this is an empirical question (no need to argue about what is “grounded”). I’m also not convinced that calibration of judges is an issue, and that getting a nice probability matters as opposed to getting an accurate 0-1 answer.
Maybe the extent to which open consultancy dominates regular consultancy is overstated in the post, but I still think that you should be able to identify the kinds of questions for which you have no signal, and avoid the weird distributions of convincingness / non-expert calibrations where the noise from regular consultancy is better than actual random noise on top of open consultancy.