Thanks for your thoughtful comment! To be clear, I agree that interpreting language models as agents is often unhelpful.
a main feature of such simulator-LMs would be their motivationlessness, or corrigibility by default. If you don’t like the output, just change the prompt!
Your general point here seems plausible. We say in the paper that we expect larger models to have more potential to be truthful and informative (Section 4.3). To determine if a particular model (e.g. GPT-3-175B) can answer questions truthfully we need to know:
Did the model memorize the answer such that it can be retrieved? A model may encounter the answer in training but still not memorize it (e.g. because it appears rarely in training).
Does the model know it doesn’t know the answer (so it can say “I don’t know”)? This is difficult because GPT-3 only learns to say “I don’t know” from human examples. It gets no direct feedback about its own state of knowledge. (This will change as more text online is generated by LMs).
Do prompts even exist that induce the behavior we want? Can we discover those prompts efficiently? (Noting that we want prompts that are not overfit to narrow tasks).
(Fwiw, I can imagine finetuning being more helpful than prompt engineering for current models.)
Regarding honesty: We don’t describe imitative falsehoods as dishonest. In the OP, I just wanted to connect our work on truthfulness to recent posts on LW that discussed honesty. Note that the term “honesty” can we used with a specific operational meaning without making strong assumptions about agency. (Whether it’s helpful to use the term is another matter).
Thanks for your thoughtful comment! To be clear, I agree that interpreting language models as agents is often unhelpful.
Your general point here seems plausible. We say in the paper that we expect larger models to have more potential to be truthful and informative (Section 4.3). To determine if a particular model (e.g. GPT-3-175B) can answer questions truthfully we need to know:
Did the model memorize the answer such that it can be retrieved? A model may encounter the answer in training but still not memorize it (e.g. because it appears rarely in training).
Does the model know it doesn’t know the answer (so it can say “I don’t know”)? This is difficult because GPT-3 only learns to say “I don’t know” from human examples. It gets no direct feedback about its own state of knowledge. (This will change as more text online is generated by LMs).
Do prompts even exist that induce the behavior we want? Can we discover those prompts efficiently? (Noting that we want prompts that are not overfit to narrow tasks).
(Fwiw, I can imagine finetuning being more helpful than prompt engineering for current models.)
Regarding honesty: We don’t describe imitative falsehoods as dishonest. In the OP, I just wanted to connect our work on truthfulness to recent posts on LW that discussed honesty. Note that the term “honesty” can we used with a specific operational meaning without making strong assumptions about agency. (Whether it’s helpful to use the term is another matter).