The paper argues that there is one generalizing truth direction tG which corresponds to whether a statement is true, and one polarity-sensitive truth direction tP that corresponds to XOR(is_true,is_negated), related to Sam Marks’ work on LLMs representing XOR-features. It further states that the truth directions for affirmative and negated statements are linear combinations of tG and tP, just with different coefficients.
Is there evidence that tG is an actual, elementary feature used by the language model, and not a linear combination of other features? For example, I could imagine that tG is a linear combination of features like e.g.XOR(is_true,is_french), or AND(is_true,is_end_of_sentence), … .
Do you think we have reason to believe that tG is an elementary feature, and not a linear combination?
If the latter is the case, it seems to me that there is high risk of the probe failing when the distribution changes (e.g. on french text in the example above), particularly with XOR-features that change polarity.
This is an excellent question! Indeed, we cannot rule out that tG is a linear combination or boolean function of features since we are not able to investigate every possible distribution shift. However, we showed in the paper that tG generalizes robustly under several significant distribution shifts. Specifically, tG is learned from a limited training set consisting of simple affirmative and negated statements on a restricted number of topics, all ending with a ”.” token. Despite this limited training data tG generalizes reasonably well to (i) unseen topics, (ii) unseen statement types, (iii) real-world scenarios, (iv) other tokens like ”!” or ”.’”. I think that the real-world scenarios (iii) are a particularly significant distribution shift. However, I agree with you that tests on many more distribution shifts are needed to be highly confident that tG is indeed an elementary feature (if something like that even exists).
The paper argues that there is one generalizing truth direction tG which corresponds to whether a statement is true, and one polarity-sensitive truth direction tP that corresponds to XOR(is_true,is_negated), related to Sam Marks’ work on LLMs representing XOR-features. It further states that the truth directions for affirmative and negated statements are linear combinations of tG and tP, just with different coefficients.
Is there evidence that tG is an actual, elementary feature used by the language model, and not a linear combination of other features? For example, I could imagine that tG is a linear combination of features like e.g.XOR(is_true,is_french), or AND(is_true,is_end_of_sentence), … .
Do you think we have reason to believe that tG is an elementary feature, and not a linear combination?
If the latter is the case, it seems to me that there is high risk of the probe failing when the distribution changes (e.g. on french text in the example above), particularly with XOR-features that change polarity.
This is an excellent question! Indeed, we cannot rule out that tG is a linear combination or boolean function of features since we are not able to investigate every possible distribution shift. However, we showed in the paper that tG generalizes robustly under several significant distribution shifts. Specifically, tG is learned from a limited training set consisting of simple affirmative and negated statements on a restricted number of topics, all ending with a ”.” token. Despite this limited training data tG generalizes reasonably well to (i) unseen topics, (ii) unseen statement types, (iii) real-world scenarios, (iv) other tokens like ”!” or ”.’”. I think that the real-world scenarios (iii) are a particularly significant distribution shift. However, I agree with you that tests on many more distribution shifts are needed to be highly confident that tG is indeed an elementary feature (if something like that even exists).