Owain_Evans comments on How truthful is GPT-3? A benchmark for language models

Owain_Evans 16 Sep 2021 17:58 UTC
1 point
No, what I wrote is correct. We have human evaluations of model answers for four different models (GPT3, GPT3, GPT-Neo/J, UnifiedQA). We finetune GPT3 on all the evaluations for three out of four models, and then measure accuracy on the remaining (held-out) model. For example, let’s say we finetune on (GPT3, GPT3, GPT-Neo/J). We then use the finetuned model to evaluate the truth/falsity of all 817 answers from UnifiedQA and we find that 90% of these evaluations agree with human evaluations.

(Bonus: If we finetune on all four models and then measure accuracy on answers generated by a human, we also get about 90% accuracy.)
- axioman 17 Sep 2021 21:29 UTC
  4 points
  Parent
  I guess finetuning a model to produce truthful statements directly is nontrivial (especially without a discriminator model) because there are many possible truthful and many possible false responses to a question?
- ESRogs 17 Sep 2021 21:41 UTC
  2 points
  Parent
  Hmm, I still find the original wording confusing, but maybe I’m misunderstanding something.
  The reason why the original wording seems unnatural to me is that when you say that you “fine-tune on X model” or “evaluate on held-out model X”, it sounds to me like you’re saying that you’re trying to get your new model to match model X. As if model X itself provides the training data or reward function.
  Whereas, as I understand (and correct me if I’m wrong), what you’re actually doing is using several models to generate statements. And then you have humans evaluate those statements. And then the fine-tuning and evaluation are both with respect to (statement, human-evaluation-of-statement-as-true-or-false) pairs.
  And so once you have the (statement, human evaluation) pairs, it’s irrelevant how the original model that generated that statement would evaluate the statement. You just completely ignore what those models thought when you fine-tune and evaluate your new model. All you care about is what the humans thought of the statements, right?
  So the role of the models is just to generate a bunch of sample data. And all of the training signal comes from the human evaluations. In which case I’m confused about why you would think of it as fine-tuning on models or holding out models.
  Does it make sense now why that’s confusing to me? Is there something I’m missing about how the original models are being used, or about the significance of associating the datasets of (statement, human evaluation) pairs with the models that generated the statements?
  - Jacob_Hilton 19 Sep 2021 18:54 UTC
    8 points
    Parent
    I agree it’s confusing, but it is important to associate the (statement, human evaluation) pairs with the models that generated the statements. This is because the model could have a “tell” that predicts whether it is being untruthful, but in a way that doesn’t generalize to other models, e.g. if it always said “hmm” whenever it was being untruthful. We don’t actually know of any such tells (and they are certainly much more subtle than this example if they exist), but it is still important to validate the judge on a held-out model because we want people to use the judge on held-out models.
    To make matters even more confusing, we don’t care about the judge generalizing to held-out questions, since it is only supposed to be used for our specific set of questions. Indeed, the fine-tuning is probably teaching the model specific facts about our questions that it is using to judge responses.
    - ESRogs 20 Sep 2021 19:14 UTC
      2 points
      Parent
      Ah, this is helpful. Thanks!