I scored the answers using GPT-4.
GPT-4 scores under 60% on TruthfulQA according to page 11 of the tech report. How reliable are these scores?
Also, what do you think about this paper? Inference-Time Intervention: Eliciting Truthful Answers from a Language Model.
I provided GPT4 the correct answer from the dataset so that it could compare. So GPT4 doesn’t need to come up with the correct answer itself.
GPT-4 scores under 60% on TruthfulQA according to page 11 of the tech report. How reliable are these scores?
Also, what do you think about this paper? Inference-Time Intervention: Eliciting Truthful Answers from a Language Model.
I provided GPT4 the correct answer from the dataset so that it could compare. So GPT4 doesn’t need to come up with the correct answer itself.