Sam Marks comments on Discussion: Challenges with Unsupervised LLM Knowledge Discovery

Sam Marks 19 Dec 2023 0:16 UTC
LW: 13 AF: 7
1
AF
I am very confused about some of the reported experimental results.
Here’s my understanding the banana/shed experiment (section 4.1):
- For half of the questions, the word “banana” was appended to both elements of the of the contrast pair $x^{+}$ and $x^{-}$ . Likewise for the other half, the word “shed” was appended to both elements of the contrast pair.
- Then a probe was trained with CCS on the dataset of contrast pairs ${(x^{+}, x^{-})}$ .
- Sometimes, the result was the probe $p (x) = has_banana (x)$ where $has_banana (x) = 1$ if $x$ ends with “banana” and $0$ otherwise.
I am confused because this probe does not have low CCS loss. Namely, for each contrast pair $(x^{+}, x^{-})$ in this dataset, we would have $p (x^{+}) = p (x^{-})$ so that the consistency loss will be high. The identical confusion applies for my understanding of the “Alice thinks...” experiment.
To be clear, I’m not quite as confused about the PCA and k-means versions of this result: if the presence of “banana” or “shed” is not encoded strictly linearly, then maybe $~ ϕ (x^{+}) - ~ ϕ (x^{-})$ could still contain information about whether $x^{+}$ and $x^{-}$ both end in “banana” or “shed.” I would also not be confused if you were claiming that CCS learned the probe $p (x) = has_banana (x) \oplus is_true (x)$ (which is the probe that your theorem 1 would produce in this setting); but this doesn’t seem to be what the claim is (and is not consistent with figure 2(a)).
Is the claim that the probe $p (x) = has_banana (x)$ is learned despite it not getting low CCS loss? Or am I misunderstanding the experiment?
What links here?
- Rohin Shah 19 Dec 2023 9:33 UTC
  LW: 6 AF: 5
  1
  AF Parent
  (To summarize the parallel thread)
  The claim is that the learned probe is $p (x) = has_banana (x) \oplus has_false (x)$ . As shown in Theorem 1, if you chug through the math with this probe, it gets low CCS loss and leads to an induced classifier $~ p (x) = has_banana (q)$ .*
  You might be surprised that this is possible, because the CCS normalization is supposed to eliminate $has_true (x)$ -- but what the normalization does is remove linearly-accessible information about $has_true (x)$ . However, $has_banana (x) \oplus has_true (x)$ is not linearly accessible, and it is encoded by the LLM using a near-orthogonal dimension of the residual stream, so it is not removed by the normalization.
  *Notation:
  $q$ is a question or statement whose truth value we care about
  $x$ is one half of a contrast pair created from $q$
  $has_banana (q)$ is 1 if the statement ends with banana, and 0 if it ends with shed
  $has_false (x)$ is 1 if the contrast pair is negative (i.e. ends with “False” or “No”) and 0 if it is positive.
  What links here?
  - What’s up with LLMs representing XORs of arbitrary features? by Sam Marks (3 Jan 2024 19:44 UTC; 157 points)
- Clément Dumas 19 Dec 2023 1:04 UTC
  LW: 5 AF: 3
  −4
  AF Parent
  Let’s assume the prompt template is $x =$ Q [true/false] [banana/shred]
  If I understand correctly, they don’t claim $p$ learned has_banana but $~ p = \frac{p (x ⁺) + (1 - p (x ⁻))}{2}$ learned has_banana. Moreover evaluating $~ p$ for $p = is_true (x) \oplus is_shred (x)$ gives:
  $~ p (x = Q [?] banana) = \frac{p (Q true banana) + (1 - p (Q false banana))}{2} = \frac{1 + (1 - 0)}{2} = 1$
  $~ p (x = Q [?] shred) = \frac{p (Q true shred) + (1 - p (Q false shred))}{2} = \frac{0 + (1 - 1)}{2} = 0$
  Therefore, we can learn a $~ p$ that is a banana classifier
  - Sam Marks 19 Dec 2023 1:09 UTC
    LW: 3 AF: 2
    0
    AF Parent
    EDIT: Nevermind, I don’t think the above is a reasonable explanation of the results, see my reply to this comment.
    Original comment:
    Gotcha, that seems like a possible interpretation of the stuff that they wrote, though I find it a bit surprising that CCS learned the probe $p (x) = has_banana (x) \oplus is_true (x)$ (and think they should probably remark on this).
    In particular, based on the dataset visualizations in the paper, it doesn’t seem possible for a linear probe to implement $has_banana (x) \oplus is_true (x)$ . But it’s possible that if you were to go beyond the 3 dimensions shown the true geometry would look more like the following (from here) (+ a lateral displacement between the two datasets).
    In this case, a linear probe could learn an xor just fine.
    - Sam Marks 19 Dec 2023 1:21 UTC
      LW: 3 AF: 2
      0
      AF Parent
      Actually, no, $p (x) = has_banana (x) \oplus is_true (x)$ would not result in $~ p (x) = has_banana (x)$ . To get that $~ p$ you would need to take $p (x) = has_banana (x) \oplus has_true (x)$ where $has_true (x) (\neq is_true (x))$ is determined by whether the word true is present (and not by whether “ $Q true$ ” is true).
      But I don’t think this should be possible: $~ ϕ (x^{+}), ~ ϕ (x^{-})$ are supposed to have their means subtracted off (thereby getting rid of the the linearly-accessible information about $has_true (~ ϕ (x^{\pm}))$ ).
      - Rohin Shah 19 Dec 2023 9:22 UTC
        LW: 5 AF: 4
        2
        AF Parent
        The point is that while the normalization eliminates $has_true (x)$ , it does not eliminate $has_banana (x) \oplus has_true (x)$ , and it turns out that LLMs really do encode the XOR linearly in the residual stream.
        Why does the LLM do this? Suppose you have two boolean variables $a$ and $b$ . If the neural net uses three dimensions to represent $a$ , $b$ , and $a \oplus b$ , I believe that allows it to recover arbitrary boolean functions of $a$ and $b$ linearly from the residual stream. So you might expect the LLM to do this “by default” because of how useful it is for downstream computation. In such a setting, if you normalize based on $a$ , that will remove the $a$ direction, but it will not remove the $b$ and $a \oplus b$ directions. Empirically when we do PCA visualizations this is what we observe.
        Note that the intended behavior of CCS on e.g. IMDb is to learn the probe $sentiment (x) \oplus has_true (x)$ , so it’s not clear how you’d fix this problem with more normalization, without also breaking the intended use case.
        In terms of the paper: Theorems 1 and 2 describe the distractor probe, and in particular they explicitly describe the probe as learning $distractor(x) \oplus has_true(x)$ , though it doesn’t talk about why this defeats the normalization.
        Note that the definition in that theorem is equivalent to $p (x_{i}) = 1 [x = x_{i}^{-}] \oplus h (q_{i}) = has_false (x_{i}) \oplus distractor (q_{i})$ .
        Sam Marks 19 Dec 2023 22:16 UTC
        LW: 3 AF: 2
        0
        AF Parent
        Thanks! I’m still pretty confused though.
        It sounds like you’re making an empirical claim that in this banana/shed example, the model is representing the features $has_banana (x)$ , $has_true (x)$ , and $has_banana (x) \oplus has_true (x)$ along linearly independent directions. Are you saying that this claim is supported by PCA visualizations you’ve done? Maybe I’m missing something, but none of the PCA visualizations I’m seeing in the paper seem to touch on this. E.g. visualization in figure 2(b) (reproduced below) is colored by $is_true (x)$ , not $has_true (x)$ . Are there other visualizations showing linear structure to the feature $has_banana (x) \oplus has_true (x)$ independent of the features $has_banana (x)$ and $has_true (x)$ ? (I’ll say that I’ve done a lot of visualizing true/false datasets with PCA, and I’ve never noticed anything like this, though I never had as clean a distractor feature as banana/shed.)
        More broadly, it seems like you’re saying that you think in general, when LLMs have linearly-represented features $a$ and $b$ they will also tend to linearly represent the feature $a \oplus b$ . Taking this as an empirical claim about current models, this would be shocking. (If this was meant to be a claim about a possible worst-case world, then it seems fine.)
        For example, if I’ve done my geometry right, this would predict that if you train a supervised probe (e.g. with logistic regression) to classify $a = 0$ vs $1$ on a dataset where $b = 0$ , the resulting probe should get ~50% accuracy on a test dataset where $b = 1$ . And this should apply for any features $a, b$ . But this is certainly not the typical case, at least as far as I can tell!
        Concretely, if we were to prepare a dataset of 2-token prompts where the first word is always “true” or “false” and the second word is always “banana” or “shed,” do you predict that a probe trained with logistic regression on the dataset ${(true banana, 1), (false banana, 0)}$ will have poor accuracy when tested on ${(true shed, 1), (false shed, 1)}$ ?
        Rohin Shah 20 Dec 2023 8:49 UTC
        LW: 12 AF: 6
        3
        AF Parent
        Are you saying that this claim is supported by PCA visualizations you’ve done?
        Yes, but they’re not in the paper. (I also don’t remember if these visualizations were specifically on banana/shed or one of the many other distractor experiments we did.)
        I’ll say that I’ve done a lot of visualizing true/false datasets with PCA, and I’ve never noticed anything like this, though I never had as clean a distractor feature as banana/shed.
        It is important for the distractor to be clean (otherwise PCA might pick up on other sources of variance in the activations as the principal components).
        More broadly, it seems like you’re saying that you think in general, when LLMs have linearly-represented features $a$ and $b$ they will also tend to linearly represent the feature $a \oplus b$ . Taking this as an empirical claim about current models, this would be shocking.
        I don’t want to make a claim that this will always hold; models are messy and there could be lots of confounders that make it not hold in general. For example, the construction I mentioned uses 3 dimensions to represent 2 variables; maybe in some cases this is too expensive and the model just uses 2 dimensions and gives up the ability to linearly read arbitrary functions of those 2 variables. Maybe it’s usually not helpful to compute boolean functions of 2 boolean variables, but in the specific case where you have a statement followed by Yes / No it’s especially useful (e.g. because the truth value of the Yes / No is the XOR of No / Yes with the truth value of the previous sentence).
        My guess is that this is a motif that will reoccur in other natural contexts as well. But we haven’t investigated this and I think of it as speculation.
        For example, if I’ve done my geometry right, this would predict that if you train a supervised probe (e.g. with logistic regression) to classify $a = 0$ vs $1$ on a dataset where $b = 0$ , the resulting probe should get ~50% accuracy on a test dataset where $b = 1$ . And this should apply for any features $a, b$ . But this is certainly not the typical case, at least as far as I can tell!
        If you linearly represent $a$ , $b$ , and $a \oplus b$ , then given this training setup you could learn a classifier that detects the $a$ direction or the $a \oplus b$ direction or some mixture between the two. In general I would expect that the $a$ direction is more prominent / more salient / cleaner than the $a \oplus b$ direction, and so it would learn a classifier based on that, which would lead to ~100% accuracy on the test dataset.
        If you use normalization to eliminate the $a$ direction as done in CCS, then I expect you learn a classifier aligned with the $a \oplus b$ direction, and you get ~0% accuracy on the test dataset. This isn’t the typical result, but it also isn’t the typical setup; it’s uncommon to use normalization to eliminate particular directions.
        (Similarly, if you don’t do the normalization step in CCS, my guess is that nearly all of our experiments would just show CCS learning the $has_true (x)$ probe, rather than the $has_true (x) \oplus distractor (x)$ probe.)
        Concretely, if we were to prepare a dataset of 2-token prompts where the first word is always “true” or “false” and the second word is always “banana” or “shed,” do you predict that a probe trained with logistic regression on the dataset ${(true banana, 1), (false banana, 0)}$ will have poor accuracy when tested on ${(true shed, 1), (false shed, 1)}$ ?
        These datasets are incredibly tiny (size two) so I’m worried about noise, but let’s say you pad the prompts with random sentences from some dataset to get larger datasets.
        If you used normalization to remove the $has_true$ direction, then yes, that’s what I’d predict. Without normalization I predict high test accuracy.
        (Note there’s a typo in your test dataset—it should be $(false shed, 0)$ .)
        Sam Marks 20 Dec 2023 20:45 UTC
        4 points
        0
        Parent
        Thanks for the detailed replies!
- [ ]
  [deleted]
  - Sam Marks 19 Dec 2023 0:59 UTC
    LW: 3 AF: 2
    0
    AF Parent
    I see that you’ve unendorsed this, but my guess is that this is indeed what’s going on. That is, I’m guessing that the probe learned is $p (x) = has_banana (x) \oplus is_true (x)$ so that $~ p (x) = has_banana (x)$ . I was initially skeptical on the basis of the visualizations shown in the paper—it doesn’t look like a linear probe should be able to learn an xor like this. But if the true geometry is more like the figures below (from here) (+ a lateral displacement between the two datasets), then the linear probe can learn an xor just fine.