LLMs necessarily have to simplify complex topics. The output for a prompt cannot represent all they know about some fact or task. Even if the output is honest and helpful (ignoring harmless for now), the simplification will necessarily obscure some details of what the LLM “intends” to do—in the sense of satisfying the user request. The model is trained to get things done. Thus, the way it simplifies has a large degree of freedom and gives the model many ways to achieve its goals.
You could think of a caring parent who tells the child a simplified version of the truth, knowing that the child will later ask additional questions and then learn the details (I have a parent in mind who is not hiding things intentionally). Nonetheless, the parent’s expectations of what the child may or may not need to know—the parent’s best model of society and the world—which may be subtly off—influence how they simplify for the benefit of the child.
This is a form of deception. The deception may be benevolent, as in the example with the parent, but we can’t know. Even if there is a chain of thought we can inspect, the same is true for that. It seems unavoidable.
as we use the term, yes. But the point (and I should have made that more clear) is that any mismodeling of the parent of the interests of the child’s interests and future environment will not be visible to the child or even someone reading the thoughts of the well-meaning parent. So many parents want the best for their child, but model the future of the child wrongly (mostly by status quo bias; the problem is different for AI).
Isn’t the same true for pretty much every conversation that people have about non-trivial topics? It’s almost always true that a person cannot represent everything they know about a topic, so they have to simplify and have lots of degrees of freedom in doing that.
Yes! That’s the right intuition. And the LLMs are doing the same—but we don’t know their world model, and thus, the direction of the simplification can be arbitrarily off.
Drilling down on the simplifications, as suggested by Villiam might help.
This could be addressed by making a user interface which not only gives the user’s prompt to the LLM, but also provides additional instructions and automatically asks additional questions. The answers to those additional questions could be displayed in smaller font as a side note, or maybe as graphical icons. One such question would be “in this answer, did you simplify things? if yes, tell me a few extra things I could pay attention to in order to get a better understanding of the topic” or something like that.
This is an interesting UI proposal and, if done right, might provide the needed transparency. Most people wouldn’t read it, but some would, esp. for critical answers.
LLMs necessarily have to simplify complex topics. The output for a prompt cannot represent all they know about some fact or task. Even if the output is honest and helpful (ignoring harmless for now), the simplification will necessarily obscure some details of what the LLM “intends” to do—in the sense of satisfying the user request. The model is trained to get things done. Thus, the way it simplifies has a large degree of freedom and gives the model many ways to achieve its goals.
You could think of a caring parent who tells the child a simplified version of the truth, knowing that the child will later ask additional questions and then learn the details (I have a parent in mind who is not hiding things intentionally). Nonetheless, the parent’s expectations of what the child may or may not need to know—the parent’s best model of society and the world—which may be subtly off—influence how they simplify for the benefit of the child.
This is a form of deception. The deception may be benevolent, as in the example with the parent, but we can’t know. Even if there is a chain of thought we can inspect, the same is true for that. It seems unavoidable.
It seems to be only “deception” if the parent tries to conceal the fact that he or she is simplifying things.
as we use the term, yes. But the point (and I should have made that more clear) is that any mismodeling of the parent of the interests of the child’s interests and future environment will not be visible to the child or even someone reading the thoughts of the well-meaning parent. So many parents want the best for their child, but model the future of the child wrongly (mostly by status quo bias; the problem is different for AI).
Isn’t the same true for pretty much every conversation that people have about non-trivial topics? It’s almost always true that a person cannot represent everything they know about a topic, so they have to simplify and have lots of degrees of freedom in doing that.
Yes! That’s the right intuition. And the LLMs are doing the same—but we don’t know their world model, and thus, the direction of the simplification can be arbitrarily off.
Drilling down on the simplifications, as suggested by Villiam might help.
This could be addressed by making a user interface which not only gives the user’s prompt to the LLM, but also provides additional instructions and automatically asks additional questions. The answers to those additional questions could be displayed in smaller font as a side note, or maybe as graphical icons. One such question would be “in this answer, did you simplify things? if yes, tell me a few extra things I could pay attention to in order to get a better understanding of the topic” or something like that.
This is an interesting UI proposal and, if done right, might provide the needed transparency. Most people wouldn’t read it, but some would, esp. for critical answers.