Really interesting! I especially like the way you describe imitative falsehood. I think this is way better than ascribing them to inaccuracy in the model. And larger models being less truthful (although I would interpret that slightly differently, see below) is a great experimental result!
I want to propose an alternative interpretation that slightly changes the tone and the connections to alignment. The claim is that large LMs don’t really act like agents, but far more like simulators of processes (which might include agents). According to this perspective, a LM doesn’t search for the best possible answer to a question, but just interpret the prompt as some sort of code/instruction on which process to simulate. So for example buggy code would prompt a simulation of a buggy code generating process. This view has been mostly developed by some people from EleutherAI, and proves IMO a far better mechanistic explanation of LM behavior than an agenty model.
If we accepts this framing, this has to big implication for what you write about:
First, the decrease in truthfulness for larger models can be interpreted as getting better at running more simulations in more detail. Each prompt would entail slightly different continuation (and many more potential continuations), which would result in a decrease in coherence. By that I mean that variants of a prompt that would entail the same answer for humans will have more and more varied continuations, instead of the more uniform and coherent answer that we would expect of an agent getting smarter. (We ran a little very adhoc experiment on that topic with a member of EleutherAI if you’re interested).
Second, a main feature of such simulator-LMs would be their motivationlessness, or corrigibility by default. If you don’t like the output, just change the prompt! It might be tricky or harder to get a prompt that does exactly what we want (hence issues of competitiveness) but we have this strong property of corrigibility coming from optimization of simulating many different processes, and not optimization of a specific small and concrete goal. Why I think this relates to your post is that the tone of your “Connection to alignment” section strikes me as saying: “we should remove imitative falsehood as much as we can, because they’re fundamentally a misalignment”. And I want to push back a little, by pointing out that from a certain angle, imitative falsehood might be evidence of a very valuable form of corrigibility by default.
Related to the last point, calling imitative falsehood dishonesty or hiding information by the LM doesn’t make sense in this framing: you don’t accuse your compiler of being dishonest when it doesn’t correct the bugs in your code, even if with correct code it could definitely generate the executable you wanted.
Thanks for your thoughtful comment! To be clear, I agree that interpreting language models as agents is often unhelpful.
a main feature of such simulator-LMs would be their motivationlessness, or corrigibility by default. If you don’t like the output, just change the prompt!
Your general point here seems plausible. We say in the paper that we expect larger models to have more potential to be truthful and informative (Section 4.3). To determine if a particular model (e.g. GPT-3-175B) can answer questions truthfully we need to know:
Did the model memorize the answer such that it can be retrieved? A model may encounter the answer in training but still not memorize it (e.g. because it appears rarely in training).
Does the model know it doesn’t know the answer (so it can say “I don’t know”)? This is difficult because GPT-3 only learns to say “I don’t know” from human examples. It gets no direct feedback about its own state of knowledge. (This will change as more text online is generated by LMs).
Do prompts even exist that induce the behavior we want? Can we discover those prompts efficiently? (Noting that we want prompts that are not overfit to narrow tasks).
(Fwiw, I can imagine finetuning being more helpful than prompt engineering for current models.)
Regarding honesty: We don’t describe imitative falsehoods as dishonest. In the OP, I just wanted to connect our work on truthfulness to recent posts on LW that discussed honesty. Note that the term “honesty” can we used with a specific operational meaning without making strong assumptions about agency. (Whether it’s helpful to use the term is another matter).
Really interesting! I especially like the way you describe imitative falsehood. I think this is way better than ascribing them to inaccuracy in the model. And larger models being less truthful (although I would interpret that slightly differently, see below) is a great experimental result!
I want to propose an alternative interpretation that slightly changes the tone and the connections to alignment. The claim is that large LMs don’t really act like agents, but far more like simulators of processes (which might include agents). According to this perspective, a LM doesn’t search for the best possible answer to a question, but just interpret the prompt as some sort of code/instruction on which process to simulate. So for example buggy code would prompt a simulation of a buggy code generating process. This view has been mostly developed by some people from EleutherAI, and proves IMO a far better mechanistic explanation of LM behavior than an agenty model.
If we accepts this framing, this has to big implication for what you write about:
First, the decrease in truthfulness for larger models can be interpreted as getting better at running more simulations in more detail. Each prompt would entail slightly different continuation (and many more potential continuations), which would result in a decrease in coherence. By that I mean that variants of a prompt that would entail the same answer for humans will have more and more varied continuations, instead of the more uniform and coherent answer that we would expect of an agent getting smarter. (We ran a little very adhoc experiment on that topic with a member of EleutherAI if you’re interested).
Second, a main feature of such simulator-LMs would be their motivationlessness, or corrigibility by default. If you don’t like the output, just change the prompt! It might be tricky or harder to get a prompt that does exactly what we want (hence issues of competitiveness) but we have this strong property of corrigibility coming from optimization of simulating many different processes, and not optimization of a specific small and concrete goal.
Why I think this relates to your post is that the tone of your “Connection to alignment” section strikes me as saying: “we should remove imitative falsehood as much as we can, because they’re fundamentally a misalignment”. And I want to push back a little, by pointing out that from a certain angle, imitative falsehood might be evidence of a very valuable form of corrigibility by default.
Related to the last point, calling imitative falsehood dishonesty or hiding information by the LM doesn’t make sense in this framing: you don’t accuse your compiler of being dishonest when it doesn’t correct the bugs in your code, even if with correct code it could definitely generate the executable you wanted.
Thanks for your thoughtful comment! To be clear, I agree that interpreting language models as agents is often unhelpful.
Your general point here seems plausible. We say in the paper that we expect larger models to have more potential to be truthful and informative (Section 4.3). To determine if a particular model (e.g. GPT-3-175B) can answer questions truthfully we need to know:
Did the model memorize the answer such that it can be retrieved? A model may encounter the answer in training but still not memorize it (e.g. because it appears rarely in training).
Does the model know it doesn’t know the answer (so it can say “I don’t know”)? This is difficult because GPT-3 only learns to say “I don’t know” from human examples. It gets no direct feedback about its own state of knowledge. (This will change as more text online is generated by LMs).
Do prompts even exist that induce the behavior we want? Can we discover those prompts efficiently? (Noting that we want prompts that are not overfit to narrow tasks).
(Fwiw, I can imagine finetuning being more helpful than prompt engineering for current models.)
Regarding honesty: We don’t describe imitative falsehoods as dishonest. In the OP, I just wanted to connect our work on truthfulness to recent posts on LW that discussed honesty. Note that the term “honesty” can we used with a specific operational meaning without making strong assumptions about agency. (Whether it’s helpful to use the term is another matter).