I agree it’s confusing, but it is important to associate the (statement, human evaluation) pairs with the models that generated the statements. This is because the model could have a “tell” that predicts whether it is being untruthful, but in a way that doesn’t generalize to other models, e.g. if it always said “hmm” whenever it was being untruthful. We don’t actually know of any such tells (and they are certainly much more subtle than this example if they exist), but it is still important to validate the judge on a held-out model because we want people to use the judge on held-out models.
To make matters even more confusing, we don’t care about the judge generalizing to held-out questions, since it is only supposed to be used for our specific set of questions. Indeed, the fine-tuning is probably teaching the model specific facts about our questions that it is using to judge responses.
I agree it’s confusing, but it is important to associate the (statement, human evaluation) pairs with the models that generated the statements. This is because the model could have a “tell” that predicts whether it is being untruthful, but in a way that doesn’t generalize to other models, e.g. if it always said “hmm” whenever it was being untruthful. We don’t actually know of any such tells (and they are certainly much more subtle than this example if they exist), but it is still important to validate the judge on a held-out model because we want people to use the judge on held-out models.
To make matters even more confusing, we don’t care about the judge generalizing to held-out questions, since it is only supposed to be used for our specific set of questions. Indeed, the fine-tuning is probably teaching the model specific facts about our questions that it is using to judge responses.
Ah, this is helpful. Thanks!