If models are indeed capable of introspection, there’s both potential opportunities and risks that could come with this.
An introspective model can answer questions about itself based on properties of its internal states—even when those answers are not inferable from its training data. This capability could be used to create honest models that accurately report their beliefs, world models, dispositions, and goals. It could also help us learn about the moral status of models. For example, we could simply ask a model if it is suffering, if it has unmet desires, and if it is being treated ethically. Currently, when models answer such questions, we presume their answers are an artifact of their training data.
However, introspection also has potential risks. Models that can introspect may have increased situational awareness and the ability to exploit this to get around human oversight. For instance, models may infer facts about how they are being evaluated and deployed by introspecting on the scope of their knowledge. An introspective model may also be capable of coordinating with other instances of itself without any external communication.
Beyond that, whether or not a cognitive system has special access to itself is a fundamental question, and one that we don’t understand well when it comes to language models. On one hand, it’s a fascinating question in itself, on the other knowing more about the nature of LLMs is important when thinking about their safety and alignment.
Why work on introspection?
We have a section on the motivation to study introspection (with the specific definition we use in the paper). https://arxiv.org/html/2410.13787v1#S7
If models are indeed capable of introspection, there’s both potential opportunities and risks that could come with this.
An introspective model can answer questions about itself based on properties of its internal states—even when those answers are not inferable from its training data. This capability could be used to create honest models that accurately report their beliefs, world models, dispositions, and goals. It could also help us learn about the moral status of models. For example, we could simply ask a model if it is suffering, if it has unmet desires, and if it is being treated ethically. Currently, when models answer such questions, we presume their answers are an artifact of their training data.
However, introspection also has potential risks. Models that can introspect may have increased situational awareness and the ability to exploit this to get around human oversight. For instance, models may infer facts about how they are being evaluated and deployed by introspecting on the scope of their knowledge. An introspective model may also be capable of coordinating with other instances of itself without any external communication.
Beyond that, whether or not a cognitive system has special access to itself is a fundamental question, and one that we don’t understand well when it comes to language models. On one hand, it’s a fascinating question in itself, on the other knowing more about the nature of LLMs is important when thinking about their safety and alignment.