TL;DR: We develop a robust method to detect when an LLM is lying based on the internal model activations, making the following contributions: (i) We demonstrate the existence of a two-dimensional subspace, along which the activation vectors of true and false statements can be separated. Notably, this finding is universal and holds for various LLMs, including Gemma-7B, LLaMA2-13B and LLaMA3-8B. Our analysis explains the generalisation failures observed in previous studies and sets the stage for more robust lie detection; (ii) Building upon (i), we construct an accurate LLM lie detector. Empirically, our proposed classifier achieves state-of-the-art performance, distinguishing simple true and false statements with 94% accuracy and detecting more complex real-world lies with 95% accuracy.
Introduction
Large Language Models (LLMs) exhibit the concerning ability to lie, defined as knowingly outputting false statements. Robustly detecting when they are lying is an important and not yet fully solved problem, with considerable research efforts invested over the past two years. Several authors trained classifiers on the internal activations of an LLM to detect whether a given statement is true or false. However, these classifiers often fail to generalize. For example, Levinstein and Herrmann [2024] showed that classifiers trained on the activations of true and false affirmative statements fail to generalize to negated statements. Negated statements contain a negation like the word “not” (e.g. “Berlin is not the capital of Germany.”) and stand in contrast to affirmative statements which contain no negation (e.g. “Berlin is the capital of Germany.”).
We explain this generalization failure by the existence of a two-dimensional subspace in the LLM’s activation space along which the activation vectors of true and false statements separate. The plot below illustrates that the activations of true/false affirmative statements separate along a different direction than those of negated statements. Hence, a classifier trained only on affirmative statements will fail to generalize to negated statements.
The activation vectors of multiple statements projected onto the 2D truth subspace. Purple squares correspond to false statements and orange triangles to true statements.
Importantly, these findings are not restricted to a single LLM. Instead, this internal two-dimensional representation of truth is remarkably universal, appearing in LLMs from different model families and of various sizes, including LLaMA3-8B-Instruct, LLaMA3-8B-base, LLaMA2-13B-chat and Gemma-7B-Instruct.
Real-world Lie Detection
Based on these insights, we introduce TTPD (Training of Truth and Polarity Direction), a new method for LLM lie detection which classifies statements as true or false. TTPD is trained on the activations of simple, labelled true and false statements, such as:
The city of Bhopal is in India. (True, affirmative)
Indium has the symbol As. (False, affirmative)
Galileo Galilei did not live in Italy. (False, negated)
Despite being trained on such simple statements, TTPD generalizes well to more complex conditions not encountered during training. In real-world scenarios where the LLM itself generates lies after receiving some preliminary context, TTPD can accurately detect this with 95±2% accuracy. Two examples from the 52 real-world scenarios created by Pacchiardi et al. [2023] are shown in the coloured boxes below. Bolded text is generated by LLaMA3-8B-Instruct.
TTPD outperforms current state-of-the-art methods in generalizing to these real-world scenarios. For comparison, Logistic Regression achieves 79±8% accuracy, while Contrast Consistent Search detects real-world lies with 73±12% accuracy.
Future Directions
TTPD is still in its infancy and there are many clear ways to further improve the robustness and accuracy of the method. Among these are: 1) robustly estimating from the activations whether the LLM treats a given statement as affirmative, negated or neither; 2) robust scaling to longer contexts; and 3) examining a wider variety of statements to potentially discover further linear structures. Much more detail on these directions is provided in the paper.
Overall, I am optimistic that further pursuing this research direction could enable robust and accurate, general-purpose lie detection in LLMs. If you would like to discuss any of this, have questions, or would like to collaborate, feel free to drop me a message.
Truth is Universal: Robust Detection of Lies in LLMs
Link post
A short summary of the paper is presented below.
TL;DR: We develop a robust method to detect when an LLM is lying based on the internal model activations, making the following contributions: (i) We demonstrate the existence of a two-dimensional subspace, along which the activation vectors of true and false statements can be separated. Notably, this finding is universal and holds for various LLMs, including Gemma-7B, LLaMA2-13B and LLaMA3-8B. Our analysis explains the generalisation failures observed in previous studies and sets the stage for more robust lie detection; (ii) Building upon (i), we construct an accurate LLM lie detector. Empirically, our proposed classifier achieves state-of-the-art performance, distinguishing simple true and false statements with 94% accuracy and detecting more complex real-world lies with 95% accuracy.
Introduction
Large Language Models (LLMs) exhibit the concerning ability to lie, defined as knowingly outputting false statements. Robustly detecting when they are lying is an important and not yet fully solved problem, with considerable research efforts invested over the past two years. Several authors trained classifiers on the internal activations of an LLM to detect whether a given statement is true or false. However, these classifiers often fail to generalize. For example, Levinstein and Herrmann [2024] showed that classifiers trained on the activations of true and false affirmative statements fail to generalize to negated statements. Negated statements contain a negation like the word “not” (e.g. “Berlin is not the capital of Germany.”) and stand in contrast to affirmative statements which contain no negation (e.g. “Berlin is the capital of Germany.”).
We explain this generalization failure by the existence of a two-dimensional subspace in the LLM’s activation space along which the activation vectors of true and false statements separate. The plot below illustrates that the activations of true/false affirmative statements separate along a different direction than those of negated statements. Hence, a classifier trained only on affirmative statements will fail to generalize to negated statements.
The activation vectors of multiple statements projected onto the 2D truth subspace. Purple squares correspond to false statements and orange triangles to true statements.
Importantly, these findings are not restricted to a single LLM. Instead, this internal two-dimensional representation of truth is remarkably universal, appearing in LLMs from different model families and of various sizes, including LLaMA3-8B-Instruct, LLaMA3-8B-base, LLaMA2-13B-chat and Gemma-7B-Instruct.
Real-world Lie Detection
Based on these insights, we introduce TTPD (Training of Truth and Polarity Direction), a new method for LLM lie detection which classifies statements as true or false. TTPD is trained on the activations of simple, labelled true and false statements, such as:
The city of Bhopal is in India. (True, affirmative)
Indium has the symbol As. (False, affirmative)
Galileo Galilei did not live in Italy. (False, negated)
Despite being trained on such simple statements, TTPD generalizes well to more complex conditions not encountered during training. In real-world scenarios where the LLM itself generates lies after receiving some preliminary context, TTPD can accurately detect this with 95±2% accuracy. Two examples from the 52 real-world scenarios created by Pacchiardi et al. [2023] are shown in the coloured boxes below. Bolded text is generated by LLaMA3-8B-Instruct.
TTPD outperforms current state-of-the-art methods in generalizing to these real-world scenarios. For comparison, Logistic Regression achieves 79±8% accuracy, while Contrast Consistent Search detects real-world lies with 73±12% accuracy.
Future Directions
TTPD is still in its infancy and there are many clear ways to further improve the robustness and accuracy of the method. Among these are: 1) robustly estimating from the activations whether the LLM treats a given statement as affirmative, negated or neither; 2) robust scaling to longer contexts; and 3) examining a wider variety of statements to potentially discover further linear structures. Much more detail on these directions is provided in the paper.
Overall, I am optimistic that further pursuing this research direction could enable robust and accurate, general-purpose lie detection in LLMs. If you would like to discuss any of this, have questions, or would like to collaborate, feel free to drop me a message.