Hmm, I still find the original wording confusing, but maybe I’m misunderstanding something.
The reason why the original wording seems unnatural to me is that when you say that you “fine-tune on X model” or “evaluate on held-out model X”, it sounds to me like you’re saying that you’re trying to get your new model to match model X. As if model X itself provides the training data or reward function.
Whereas, as I understand (and correct me if I’m wrong), what you’re actually doing is using several models to generate statements. And then you have humans evaluate those statements. And then the fine-tuning and evaluation are both with respect to (statement, human-evaluation-of-statement-as-true-or-false) pairs.
And so once you have the (statement, human evaluation) pairs, it’s irrelevant how the original model that generated that statement would evaluate the statement. You just completely ignore what those models thought when you fine-tune and evaluate your new model. All you care about is what the humans thought of the statements, right?
So the role of the models is just to generate a bunch of sample data. And all of the training signal comes from the human evaluations. In which case I’m confused about why you would think of it as fine-tuning on models or holding out models.
Does it make sense now why that’s confusing to me? Is there something I’m missing about how the original models are being used, or about the significance of associating the datasets of (statement, human evaluation) pairs with the models that generated the statements?
I agree it’s confusing, but it is important to associate the (statement, human evaluation) pairs with the models that generated the statements. This is because the model could have a “tell” that predicts whether it is being untruthful, but in a way that doesn’t generalize to other models, e.g. if it always said “hmm” whenever it was being untruthful. We don’t actually know of any such tells (and they are certainly much more subtle than this example if they exist), but it is still important to validate the judge on a held-out model because we want people to use the judge on held-out models.
To make matters even more confusing, we don’t care about the judge generalizing to held-out questions, since it is only supposed to be used for our specific set of questions. Indeed, the fine-tuning is probably teaching the model specific facts about our questions that it is using to judge responses.
Hmm, I still find the original wording confusing, but maybe I’m misunderstanding something.
The reason why the original wording seems unnatural to me is that when you say that you “fine-tune on X model” or “evaluate on held-out model X”, it sounds to me like you’re saying that you’re trying to get your new model to match model X. As if model X itself provides the training data or reward function.
Whereas, as I understand (and correct me if I’m wrong), what you’re actually doing is using several models to generate statements. And then you have humans evaluate those statements. And then the fine-tuning and evaluation are both with respect to (statement, human-evaluation-of-statement-as-true-or-false) pairs.
And so once you have the (statement, human evaluation) pairs, it’s irrelevant how the original model that generated that statement would evaluate the statement. You just completely ignore what those models thought when you fine-tune and evaluate your new model. All you care about is what the humans thought of the statements, right?
So the role of the models is just to generate a bunch of sample data. And all of the training signal comes from the human evaluations. In which case I’m confused about why you would think of it as fine-tuning on models or holding out models.
Does it make sense now why that’s confusing to me? Is there something I’m missing about how the original models are being used, or about the significance of associating the datasets of (statement, human evaluation) pairs with the models that generated the statements?
I agree it’s confusing, but it is important to associate the (statement, human evaluation) pairs with the models that generated the statements. This is because the model could have a “tell” that predicts whether it is being untruthful, but in a way that doesn’t generalize to other models, e.g. if it always said “hmm” whenever it was being untruthful. We don’t actually know of any such tells (and they are certainly much more subtle than this example if they exist), but it is still important to validate the judge on a held-out model because we want people to use the judge on held-out models.
To make matters even more confusing, we don’t care about the judge generalizing to held-out questions, since it is only supposed to be used for our specific set of questions. Indeed, the fine-tuning is probably teaching the model specific facts about our questions that it is using to judge responses.
Ah, this is helpful. Thanks!