Fabien Roger comments on On scalable oversight with weak LLMs judging strong LLMs

Fabien Roger 9 Jul 2024 1:00 UTC
LW: 7 AF: 6
0
AF
Cool work!
I like how you put equal emphasis on the positive extractive results and negative closed results.
I also appreciate you ran open consultancy experiments! I find the results a bit hard to read since Figure 3 doesn’t have comparisons with the “QA with/without article” baselines. But trying to tie Figure 2 & 3 together (which is maybe invalid, I haven’t checked very carefully):
- On extractive tasks, open consultancy (where the consultant argues honestly) matches the “QA with article” of Gemini 1.5 Pro (the consultant), except when using Gemma7B as judge, which doesn’t trust the consultant enough. Open debate is roughly as bad as regular debate—that is quite bad.
- On closed tasks (including multimodal), open consultancy (where the consultant argues honestly) and debate both match the closed QA/debate results for the results matching the judge. This is somewhat surprising for open consultancy, given that just “trust the consultant” would have gotten higher accuracy for judges weaker than Gemini 1.5 Pro! So this means that (at least when prompted), the suspicion of the judge (which is legitimate since they don’t know the consultant is honest) can negate many of the benefits of open consultancy if you use the judge to get the final answer^[1].
I’m not surprised at all that open consultancy does poorly when the consultant believes the wrong fact—the way I think about open consultancy is a way of extracting the beliefs of a model, not a way to figure out what is true in a way that can catch mistakes of the AI! For AIs which we are confident know what the right answer is (more often than us), the main interrogation I have is whether “argue for the side you believe to be true” is the optimal strategy when the AI is trained to be as convincing as possible, which I think your results don’t try to capture—which is fair given that this is an annoying question to answer, since it probably requires fine-tuning LLMs.
1. ^
  Which is not what I described in the Open Consultancy post where I suggest just trusting the open consultant once you’ve trained it, but I guess that’s another reasonable way to do it—and it’s the only reasonable way to do it if you assume honest consultants.
- zac_kenton 11 Jul 2024 16:03 UTC
  LW: 3 AF: 3
  0
  AF Parent
  Thanks for the comment Fabien. A couple of points:
  - open debate accuracy is (almost, except for the way we handle invalid answers, which is very rare) the same as debate accuracy. That’s because the data is almost exactly the same—we’re just marking one debater as a protagonist based on what that model would choose under direct QA. So it’s not bad that open debate has same accuracy as debate, that was expected. It is kinda bad that it’s somewhat worse than open consultancy, though we didn’t try ‘fully open debate’ where debaters can both pick same side (or opposite, perhaps under resampling/rephrasing etc). This is probably a better comparison to open consultancy.
  - your points about open consultancy, which I roughly understand as ‘weak judge would score higher if they just trusted the consultant’ is a good point, and has made us double check our filtering code, and I think we do have a bug there (accidentally used the weak judge’s model under direct QA to select the consultant’s answer, should have used consultant’s model, and similarly for debate). Fixing that bug brings the open consultancy accuracies for weak judges up to roughly in line with direct QA accuracy of strong consultant’s model (so it is better than open debate), and slightly increases protagonist winrate (without affecting debate accuracy).
  Thanks so much for this—it prompted us to look for bugs! We will update the arxiv and add an edit on this blogpost.
  On the footnote—sorry for confusion, but we do still think it’s meaningful to take the answer as what the judge gives (as we’re interested in the kind of feedback the consultant would get for training, rather than just how the consultant is performing, for which consultant accuracy is more appropriate).
  And yes, I am interested in the versions of these protocols that incentivise arguing for the side ‘you believe to be true’/‘is most convincing’ and seeing how that affects the judge. We’re aiming for these set ups in the next project (e.g. ‘fully open debate’ above).
  - Fabien Roger 12 Jul 2024 16:50 UTC
    LW: 4 AF: 3
    0
    AF Parent
    Fair for open debate. It’s still a bit unclear to me what open debate is supposed to do on top of regular debate. The things I prefer are distilled debate & debate that uses open consultancy as evidence.
    I appreciate you taking the time to fix these results and flagging the change in this comment!
    we’re interested in the kind of feedback the consultant would get for training, rather than just how the consultant is performing, for which consultant accuracy is more appropriate
    I think I agree that you wouldn’t capture training dynamics at all by reporting results on “trusting an honest consultant” accuracy (which is just QA accuracy...), and your data is closer to capturing the training dynamics, though it’s not exactly that.
    The experiment I am most excited about is “if you train a consultant to be as convincing as possible (and choose its side), does it eventually learn to argue for an answer it thinks is wrong because it is easier to argue for, or does it just learn to argue for what it thinks is true?”.
    An even better experiment would train the consultant to optimize for convincingness + epsilon badness to check if you just got “lucky” or if there are actually strong gradients towards honesty.
    I think the BoN version of this is also somewhat interesting (e.g. “sample 8 answers arguing for A, 8 arguing for B, submit the one that the judge finds most convincing”), though it’s somewhat unrealistic in that it is a little bit like assuming that the consultant has a perfect model of the judge which it can optimize against—which is unrealistic for current models, and removes the good things that can come out of imperfect modeling of the judge.
    Feel free to DM me if you want to have a chat about any of this!
  - zac_kenton 11 Jul 2024 16:11 UTC
    LW: 1 AF: 1
    0
    AF Parent
    The post has now been edited with the updated plots for open consultancy/debate.