The claim is that the learned probe is p(x)=has_banana(x)⊕has_false(x). As shown in Theorem 1, if you chug through the math with this probe, it gets low CCS loss and leads to an induced classifier ~p(x)=has_banana(q).*
You might be surprised that this is possible, because the CCS normalization is supposed to eliminate has_true(x) -- but what the normalization does is remove linearly-accessible information about has_true(x). However, has_banana(x)⊕has_true(x) is not linearly accessible, and it is encoded by the LLM using a near-orthogonal dimension of the residual stream, so it is not removed by the normalization.
*Notation:
q is a question or statement whose truth value we care about
x is one half of a contrast pair created from q
has_banana(q) is 1 if the statement ends with banana, and 0 if it ends with shed
has_false(x) is 1 if the contrast pair is negative (i.e. ends with “False” or “No”) and 0 if it is positive.
(To summarize the parallel thread)
The claim is that the learned probe is p(x)=has_banana(x)⊕has_false(x). As shown in Theorem 1, if you chug through the math with this probe, it gets low CCS loss and leads to an induced classifier ~p(x)=has_banana(q).*
You might be surprised that this is possible, because the CCS normalization is supposed to eliminate has_true(x) -- but what the normalization does is remove linearly-accessible information about has_true(x). However, has_banana(x)⊕has_true(x) is not linearly accessible, and it is encoded by the LLM using a near-orthogonal dimension of the residual stream, so it is not removed by the normalization.
*Notation:
q is a question or statement whose truth value we care about
x is one half of a contrast pair created from q
has_banana(q) is 1 if the statement ends with banana, and 0 if it ends with shed
has_false(x) is 1 if the contrast pair is negative (i.e. ends with “False” or “No”) and 0 if it is positive.