zac_kenton comments on On scalable oversight with weak LLMs judging strong LLMs

zac_kenton 11 Jul 2024 16:03 UTC
LW: 3 AF: 3
0
AF
Thanks for the comment Fabien. A couple of points:
- open debate accuracy is (almost, except for the way we handle invalid answers, which is very rare) the same as debate accuracy. That’s because the data is almost exactly the same—we’re just marking one debater as a protagonist based on what that model would choose under direct QA. So it’s not bad that open debate has same accuracy as debate, that was expected. It is kinda bad that it’s somewhat worse than open consultancy, though we didn’t try ‘fully open debate’ where debaters can both pick same side (or opposite, perhaps under resampling/rephrasing etc). This is probably a better comparison to open consultancy.
- your points about open consultancy, which I roughly understand as ‘weak judge would score higher if they just trusted the consultant’ is a good point, and has made us double check our filtering code, and I think we do have a bug there (accidentally used the weak judge’s model under direct QA to select the consultant’s answer, should have used consultant’s model, and similarly for debate). Fixing that bug brings the open consultancy accuracies for weak judges up to roughly in line with direct QA accuracy of strong consultant’s model (so it is better than open debate), and slightly increases protagonist winrate (without affecting debate accuracy).
Thanks so much for this—it prompted us to look for bugs! We will update the arxiv and add an edit on this blogpost.
On the footnote—sorry for confusion, but we do still think it’s meaningful to take the answer as what the judge gives (as we’re interested in the kind of feedback the consultant would get for training, rather than just how the consultant is performing, for which consultant accuracy is more appropriate).
And yes, I am interested in the versions of these protocols that incentivise arguing for the side ‘you believe to be true’/‘is most convincing’ and seeing how that affects the judge. We’re aiming for these set ups in the next project (e.g. ‘fully open debate’ above).
- Fabien Roger 12 Jul 2024 16:50 UTC
  LW: 4 AF: 3
  0
  AF Parent
  Fair for open debate. It’s still a bit unclear to me what open debate is supposed to do on top of regular debate. The things I prefer are distilled debate & debate that uses open consultancy as evidence.
  I appreciate you taking the time to fix these results and flagging the change in this comment!
  we’re interested in the kind of feedback the consultant would get for training, rather than just how the consultant is performing, for which consultant accuracy is more appropriate
  I think I agree that you wouldn’t capture training dynamics at all by reporting results on “trusting an honest consultant” accuracy (which is just QA accuracy...), and your data is closer to capturing the training dynamics, though it’s not exactly that.
  The experiment I am most excited about is “if you train a consultant to be as convincing as possible (and choose its side), does it eventually learn to argue for an answer it thinks is wrong because it is easier to argue for, or does it just learn to argue for what it thinks is true?”.
  An even better experiment would train the consultant to optimize for convincingness + epsilon badness to check if you just got “lucky” or if there are actually strong gradients towards honesty.
  I think the BoN version of this is also somewhat interesting (e.g. “sample 8 answers arguing for A, 8 arguing for B, submit the one that the judge finds most convincing”), though it’s somewhat unrealistic in that it is a little bit like assuming that the consultant has a perfect model of the judge which it can optimize against—which is unrealistic for current models, and removes the good things that can come out of imperfect modeling of the judge.
  Feel free to DM me if you want to have a chat about any of this!
- zac_kenton 11 Jul 2024 16:11 UTC
  LW: 1 AF: 1
  0
  AF Parent
  The post has now been edited with the updated plots for open consultancy/debate.