I appreciate you taking the time to fix these results and flagging the change in this comment!
we’re interested in the kind of feedback the consultant would get for training, rather than just how the consultant is performing, for which consultant accuracy is more appropriate
I think I agree that you wouldn’t capture training dynamics at all by reporting results on “trusting an honest consultant” accuracy (which is just QA accuracy...), and your data is closer to capturing the training dynamics, though it’s not exactly that.
The experiment I am most excited about is “if you train a consultant to be as convincing as possible (and choose its side), does it eventually learn to argue for an answer it thinks is wrong because it is easier to argue for, or does it just learn to argue for what it thinks is true?”.
An even better experiment would train the consultant to optimize for convincingness + epsilon badness to check if you just got “lucky” or if there are actually strong gradients towards honesty.
I think the BoN version of this is also somewhat interesting (e.g. “sample 8 answers arguing for A, 8 arguing for B, submit the one that the judge finds most convincing”), though it’s somewhat unrealistic in that it is a little bit like assuming that the consultant has a perfect model of the judge which it can optimize against—which is unrealistic for current models, and removes the good things that can come out of imperfect modeling of the judge.
Feel free to DM me if you want to have a chat about any of this!
Fair for open debate. It’s still a bit unclear to me what open debate is supposed to do on top of regular debate. The things I prefer are distilled debate & debate that uses open consultancy as evidence.
I appreciate you taking the time to fix these results and flagging the change in this comment!
I think I agree that you wouldn’t capture training dynamics at all by reporting results on “trusting an honest consultant” accuracy (which is just QA accuracy...), and your data is closer to capturing the training dynamics, though it’s not exactly that.
The experiment I am most excited about is “if you train a consultant to be as convincing as possible (and choose its side), does it eventually learn to argue for an answer it thinks is wrong because it is easier to argue for, or does it just learn to argue for what it thinks is true?”.
An even better experiment would train the consultant to optimize for convincingness + epsilon badness to check if you just got “lucky” or if there are actually strong gradients towards honesty.
I think the BoN version of this is also somewhat interesting (e.g. “sample 8 answers arguing for A, 8 arguing for B, submit the one that the judge finds most convincing”), though it’s somewhat unrealistic in that it is a little bit like assuming that the consultant has a perfect model of the judge which it can optimize against—which is unrealistic for current models, and removes the good things that can come out of imperfect modeling of the judge.
Feel free to DM me if you want to have a chat about any of this!