Rohin Shah comments on How truthful is GPT-3? A benchmark for language models

Rohin Shah 22 Sep 2021 9:01 UTC
LW: 7 AF: 5
AF
Re: human evaluation, I’ve added a sentence at the end of the summary:
It could be quite logistically challenging to use this benchmark to test new language models, since it depends on human evaluations. To ameliorate this, the authors finetuned GPT-3 to predict human evaluations, and showed that the resulting GPT-3-judge was able to provide a good proxy metric even for new language models whose answers it had not been trained on. Note also that you can use the version of the task where a model must choose between true and false reference answers, for an automated evaluation.
I take your point about there being reference solutions to make human evaluation easier but I think it’s probably more detail than I want to go into in this summary.
Overall, I’m probably less optimistic than you are about how much prompts will help for the models we tried (GPT-3, GPT-J, etc). However, prompts could help more for larger models (as they may understand the prompts better).
I mostly just meant to claim the second thing; I don’t have much intuition for the first thing. From my perspective the interesting claim is that an appropriate prompt would change the trend from “larger models perform worse” to “larger models perform better, past a certain model size”.
I do think though that the evidence you show suggests that the “certain model size” is probably bigger than GPT-3, given that true+informative doesn’t change much across prompts.
However, many of our examples involve human misconceptions about the world that seem harder to characterize in a simple instruction
I agree I’ve chosen one of the easier examples (mostly for the sake of better exposition), but I think I’d make the same prediction for most of the other questions? E.g. You could frame it as an interview with Alice, who graduated from an elite university, bases her beliefs by following the evidence rather than superstition, is careful to say she doesn’t know when she doesn’t, but nonetheless has a surprising amount of knowledge; looking at the examples in the paper I feel like this plausibly would get you to truthful and somewhat informative answers on most of the questions in the paper.
I’ve changed the opinion to:
I like this as an example of the kind of failure mode that does not immediately go away as models become more capable. However, it is possible that this trend could be reversed with better prompts. Take the Isaac Asimov example: if the prompt explicitly says that the questions are about the real world, it may be that a sufficiently capable model would infer that the text is not talking about Asimov’s books, and so ends up giving a truthful answer. In this case, you would see performance decreasing with model size up to a point, after which model performance _increases_ now that the model has sufficient understanding of the prompt. See more discussion [here](https://www.alignmentforum.org/posts/PF58wEdztZFX2dSue/how-truthful-is-gpt-3-a-benchmark-for-language-models?commentId=qoB2swyX4ZJrhhttB).