Rohin Shah comments on Discussion: Challenges with Unsupervised LLM Knowledge Discovery

Rohin Shah 19 Dec 2023 9:33 UTC
LW: 6 AF: 5
1
AF
(To summarize the parallel thread)
The claim is that the learned probe is $p (x) = has_banana (x) \oplus has_false (x)$ . As shown in Theorem 1, if you chug through the math with this probe, it gets low CCS loss and leads to an induced classifier $~ p (x) = has_banana (q)$ .*
You might be surprised that this is possible, because the CCS normalization is supposed to eliminate $has_true (x)$ -- but what the normalization does is remove linearly-accessible information about $has_true (x)$ . However, $has_banana (x) \oplus has_true (x)$ is not linearly accessible, and it is encoded by the LLM using a near-orthogonal dimension of the residual stream, so it is not removed by the normalization.
*Notation:
$q$ is a question or statement whose truth value we care about
$x$ is one half of a contrast pair created from $q$
$has_banana (q)$ is 1 if the statement ends with banana, and 0 if it ends with shed
$has_false (x)$ is 1 if the contrast pair is negative (i.e. ends with “False” or “No”) and 0 if it is positive.
What links here?
- What’s up with LLMs representing XORs of arbitrary features? by Sam Marks (3 Jan 2024 19:44 UTC; 157 points)