The point is that while the normalization eliminates has_true(x), it does not eliminate has_banana(x)⊕has_true(x), and it turns out that LLMs really do encode the XOR linearly in the residual stream.
Why does the LLM do this? Suppose you have two boolean variables a and b. If the neural net uses three dimensions to represent a, b, and a⊕b, I believe that allows it to recover arbitrary boolean functions of a and b linearly from the residual stream. So you might expect the LLM to do this “by default” because of how useful it is for downstream computation. In such a setting, if you normalize based on a, that will remove the a direction, but it will not remove the b and a⊕b directions. Empirically when we do PCA visualizations this is what we observe.
Note that the intended behavior of CCS on e.g. IMDb is to learn the probe sentiment(x)⊕has_true(x), so it’s not clear how you’d fix this problem with more normalization, without also breaking the intended use case.
In terms of the paper: Theorems 1 and 2 describe the distractor probe, and in particular they explicitly describe the probe as learning distractor(x)⊕has_true(x), though it doesn’t talk about why this defeats the normalization.
Note that the definition in that theorem is equivalent to p(xi)=1[x=x−i]⊕h(qi)=has_false(xi)⊕distractor(qi).
It sounds like you’re making an empirical claim that in this banana/shed example, the model is representing the features has_banana(x), has_true(x), and has_banana(x)⊕has_true(x) along linearly independent directions. Are you saying that this claim is supported by PCA visualizations you’ve done? Maybe I’m missing something, but none of the PCA visualizations I’m seeing in the paper seem to touch on this. E.g. visualization in figure 2(b) (reproduced below) is colored by is_true(x), not has_true(x). Are there other visualizations showing linear structure to the feature has_banana(x)⊕has_true(x) independent of the features has_banana(x) and has_true(x)? (I’ll say that I’ve done a lot of visualizing true/false datasets with PCA, and I’ve never noticed anything like this, though I never had as clean a distractor feature as banana/shed.)
More broadly, it seems like you’re saying that you think in general, when LLMs have linearly-represented features a and b they will also tend to linearly represent the feature a⊕b. Taking this as an empirical claim about current models, this would be shocking. (If this was meant to be a claim about a possible worst-case world, then it seems fine.)
For example, if I’ve done my geometry right, this would predict that if you train a supervised probe (e.g. with logistic regression) to classify a=0 vs 1 on a dataset where b=0, the resulting probe should get ~50% accuracy on a test dataset where b=1. And this should apply for any features a,b. But this is certainly not the typical case, at least as far as I can tell!
Concretely, if we were to prepare a dataset of 2-token prompts where the first word is always “true” or “false” and the second word is always “banana” or “shed,” do you predict that a probe trained with logistic regression on the dataset {(true banana,1),(false banana,0)} will have poor accuracy when tested on {(true shed,1),(false shed,1)}?
Are you saying that this claim is supported by PCA visualizations you’ve done?
Yes, but they’re not in the paper. (I also don’t remember if these visualizations were specifically on banana/shed or one of the many other distractor experiments we did.)
I’ll say that I’ve done a lot of visualizing true/false datasets with PCA, and I’ve never noticed anything like this, though I never had as clean a distractor feature as banana/shed.
It is important for the distractor to be clean (otherwise PCA might pick up on other sources of variance in the activations as the principal components).
More broadly, it seems like you’re saying that you think in general, when LLMs have linearly-represented features a and b they will also tend to linearly represent the feature a⊕b. Taking this as an empirical claim about current models, this would be shocking.
I don’t want to make a claim that this will always hold; models are messy and there could be lots of confounders that make it not hold in general. For example, the construction I mentioned uses 3 dimensions to represent 2 variables; maybe in some cases this is too expensive and the model just uses 2 dimensions and gives up the ability to linearly read arbitrary functions of those 2 variables. Maybe it’s usually not helpful to compute boolean functions of 2 boolean variables, but in the specific case where you have a statement followed by Yes / No it’s especially useful (e.g. because the truth value of the Yes / No is the XOR of No / Yes with the truth value of the previous sentence).
My guess is that this is a motif that will reoccur in other natural contexts as well. But we haven’t investigated this and I think of it as speculation.
For example, if I’ve done my geometry right, this would predict that if you train a supervised probe (e.g. with logistic regression) to classify a=0 vs 1 on a dataset where b=0, the resulting probe should get ~50% accuracy on a test dataset where b=1. And this should apply for any features a,b. But this is certainly not the typical case, at least as far as I can tell!
If you linearly represent a, b, and a⊕b, then given this training setup you could learn a classifier that detects the a direction or the a⊕b direction or some mixture between the two. In general I would expect that the a direction is more prominent / more salient / cleaner than the a⊕b direction, and so it would learn a classifier based on that, which would lead to ~100% accuracy on the test dataset.
If you use normalization to eliminate the a direction as done in CCS, then I expect you learn a classifier aligned with the a⊕b direction, and you get ~0% accuracy on the test dataset. This isn’t the typical result, but it also isn’t the typical setup; it’s uncommon to use normalization to eliminate particular directions.
(Similarly, if you don’t do the normalization step in CCS, my guess is that nearly all of our experiments would just show CCS learning the has_true(x) probe, rather than the has_true(x)⊕distractor(x) probe.)
Concretely, if we were to prepare a dataset of 2-token prompts where the first word is always “true” or “false” and the second word is always “banana” or “shed,” do you predict that a probe trained with logistic regression on the dataset {(true banana,1),(false banana,0)} will have poor accuracy when tested on {(true shed,1),(false shed,1)}?
These datasets are incredibly tiny (size two) so I’m worried about noise, but let’s say you pad the prompts with random sentences from some dataset to get larger datasets.
If you used normalization to remove the has_true direction, then yes, that’s what I’d predict. Without normalization I predict high test accuracy.
(Note there’s a typo in your test dataset—it should be (false shed,0).)
The point is that while the normalization eliminates has_true(x), it does not eliminate has_banana(x)⊕has_true(x), and it turns out that LLMs really do encode the XOR linearly in the residual stream.
Why does the LLM do this? Suppose you have two boolean variables a and b. If the neural net uses three dimensions to represent a, b, and a⊕b, I believe that allows it to recover arbitrary boolean functions of a and b linearly from the residual stream. So you might expect the LLM to do this “by default” because of how useful it is for downstream computation. In such a setting, if you normalize based on a, that will remove the a direction, but it will not remove the b and a⊕b directions. Empirically when we do PCA visualizations this is what we observe.
Note that the intended behavior of CCS on e.g. IMDb is to learn the probe sentiment(x)⊕has_true(x), so it’s not clear how you’d fix this problem with more normalization, without also breaking the intended use case.
In terms of the paper: Theorems 1 and 2 describe the distractor probe, and in particular they explicitly describe the probe as learning distractor(x)⊕has_true(x), though it doesn’t talk about why this defeats the normalization.
Note that the definition in that theorem is equivalent to p(xi)=1[x=x−i]⊕h(qi)=has_false(xi)⊕distractor(qi).
Thanks! I’m still pretty confused though.
It sounds like you’re making an empirical claim that in this banana/shed example, the model is representing the features has_banana(x), has_true(x), and has_banana(x)⊕has_true(x) along linearly independent directions. Are you saying that this claim is supported by PCA visualizations you’ve done? Maybe I’m missing something, but none of the PCA visualizations I’m seeing in the paper seem to touch on this. E.g. visualization in figure 2(b) (reproduced below) is colored by is_true(x), not has_true(x). Are there other visualizations showing linear structure to the feature has_banana(x)⊕has_true(x) independent of the features has_banana(x) and has_true(x)? (I’ll say that I’ve done a lot of visualizing true/false datasets with PCA, and I’ve never noticed anything like this, though I never had as clean a distractor feature as banana/shed.)
More broadly, it seems like you’re saying that you think in general, when LLMs have linearly-represented features a and b they will also tend to linearly represent the feature a⊕b. Taking this as an empirical claim about current models, this would be shocking. (If this was meant to be a claim about a possible worst-case world, then it seems fine.)
For example, if I’ve done my geometry right, this would predict that if you train a supervised probe (e.g. with logistic regression) to classify a=0 vs 1 on a dataset where b=0, the resulting probe should get ~50% accuracy on a test dataset where b=1. And this should apply for any features a,b. But this is certainly not the typical case, at least as far as I can tell!
Concretely, if we were to prepare a dataset of 2-token prompts where the first word is always “true” or “false” and the second word is always “banana” or “shed,” do you predict that a probe trained with logistic regression on the dataset {(true banana,1),(false banana,0)} will have poor accuracy when tested on {(true shed,1),(false shed,1)}?
Yes, but they’re not in the paper. (I also don’t remember if these visualizations were specifically on banana/shed or one of the many other distractor experiments we did.)
It is important for the distractor to be clean (otherwise PCA might pick up on other sources of variance in the activations as the principal components).
I don’t want to make a claim that this will always hold; models are messy and there could be lots of confounders that make it not hold in general. For example, the construction I mentioned uses 3 dimensions to represent 2 variables; maybe in some cases this is too expensive and the model just uses 2 dimensions and gives up the ability to linearly read arbitrary functions of those 2 variables. Maybe it’s usually not helpful to compute boolean functions of 2 boolean variables, but in the specific case where you have a statement followed by Yes / No it’s especially useful (e.g. because the truth value of the Yes / No is the XOR of No / Yes with the truth value of the previous sentence).
My guess is that this is a motif that will reoccur in other natural contexts as well. But we haven’t investigated this and I think of it as speculation.
If you linearly represent a, b, and a⊕b, then given this training setup you could learn a classifier that detects the a direction or the a⊕b direction or some mixture between the two. In general I would expect that the a direction is more prominent / more salient / cleaner than the a⊕b direction, and so it would learn a classifier based on that, which would lead to ~100% accuracy on the test dataset.
If you use normalization to eliminate the a direction as done in CCS, then I expect you learn a classifier aligned with the a⊕b direction, and you get ~0% accuracy on the test dataset. This isn’t the typical result, but it also isn’t the typical setup; it’s uncommon to use normalization to eliminate particular directions.
(Similarly, if you don’t do the normalization step in CCS, my guess is that nearly all of our experiments would just show CCS learning the has_true(x) probe, rather than the has_true(x)⊕distractor(x) probe.)
These datasets are incredibly tiny (size two) so I’m worried about noise, but let’s say you pad the prompts with random sentences from some dataset to get larger datasets.
If you used normalization to remove the has_true direction, then yes, that’s what I’d predict. Without normalization I predict high test accuracy.
(Note there’s a typo in your test dataset—it should be (false shed,0).)
Thanks for the detailed replies!