I am very confused about some of the reported experimental results.
Here’s my understanding the banana/shed experiment (section 4.1):
For half of the questions, the word “banana” was appended to both elements of the of the contrast pair x+ and x−. Likewise for the other half, the word “shed” was appended to both elements of the contrast pair.
Then a probe was trained with CCS on the dataset of contrast pairs {(x+,x−)}.
Sometimes, the result was the probe p(x)=has_banana(x) where has_banana(x)=1 if x ends with “banana” and 0 otherwise.
I am confused because this probe does not have low CCS loss. Namely, for each contrast pair (x+,x−) in this dataset, we would have p(x+)=p(x−) so that the consistency loss will be high. The identical confusion applies for my understanding of the “Alice thinks...” experiment.
To be clear, I’m not quite as confused about the PCA and k-means versions of this result: if the presence of “banana” or “shed” is not encoded strictly linearly, then maybe ~ϕ(x+)−~ϕ(x−) could still contain information about whether x+ and x− both end in “banana” or “shed.” I would also not be confused if you were claiming that CCS learned the probe p(x)=has_banana(x)⊕is_true(x) (which is the probe that your theorem 1 would produce in this setting); but this doesn’t seem to be what the claim is (and is not consistent with figure 2(a)).
Is the claim that the probe p(x)=has_banana(x) is learned despite it not getting low CCS loss? Or am I misunderstanding the experiment?
The claim is that the learned probe is p(x)=has_banana(x)⊕has_false(x). As shown in Theorem 1, if you chug through the math with this probe, it gets low CCS loss and leads to an induced classifier ~p(x)=has_banana(q).*
You might be surprised that this is possible, because the CCS normalization is supposed to eliminate has_true(x) -- but what the normalization does is remove linearly-accessible information about has_true(x). However, has_banana(x)⊕has_true(x) is not linearly accessible, and it is encoded by the LLM using a near-orthogonal dimension of the residual stream, so it is not removed by the normalization.
*Notation:
q is a question or statement whose truth value we care about
x is one half of a contrast pair created from q
has_banana(q) is 1 if the statement ends with banana, and 0 if it ends with shed
has_false(x) is 1 if the contrast pair is negative (i.e. ends with “False” or “No”) and 0 if it is positive.
Let’s assume the prompt template is x= Q [true/false] [banana/shred]
If I understand correctly, they don’t claim p learned has_banana but ~p=p(x⁺)+(1−p(x⁻))2 learned has_banana. Moreover evaluating ~p for p=is_true(x)⊕is_shred(x) gives:
EDIT: Nevermind, I don’t think the above is a reasonable explanation of the results, see my reply to this comment.
Original comment:
Gotcha, that seems like a possible interpretation of the stuff that they wrote, though I find it a bit surprising that CCS learned the probe p(x)=has_banana(x)⊕is_true(x) (and think they should probably remark on this).
In particular, based on the dataset visualizations in the paper, it doesn’t seem possible for a linear probe to implement has_banana(x)⊕is_true(x). But it’s possible that if you were to go beyond the 3 dimensions shown the true geometry would look more like the following (from here) (+ a lateral displacement between the two datasets).
In this case, a linear probe could learn an xor just fine.
Actually, no, p(x)=has_banana(x)⊕is_true(x) would not result in ~p(x)=has_banana(x). To get that ~p you would need to take p(x)=has_banana(x)⊕has_true(x) where has_true(x)(≠is_true(x)) is determined by whether the word true is present (and not by whether “Q true” is true).
But I don’t think this should be possible: ~ϕ(x+),~ϕ(x−) are supposed to have their means subtracted off (thereby getting rid of the the linearly-accessible information about has_true(~ϕ(x±))).
The point is that while the normalization eliminates has_true(x), it does not eliminate has_banana(x)⊕has_true(x), and it turns out that LLMs really do encode the XOR linearly in the residual stream.
Why does the LLM do this? Suppose you have two boolean variables a and b. If the neural net uses three dimensions to represent a, b, and a⊕b, I believe that allows it to recover arbitrary boolean functions of a and b linearly from the residual stream. So you might expect the LLM to do this “by default” because of how useful it is for downstream computation. In such a setting, if you normalize based on a, that will remove the a direction, but it will not remove the b and a⊕b directions. Empirically when we do PCA visualizations this is what we observe.
Note that the intended behavior of CCS on e.g. IMDb is to learn the probe sentiment(x)⊕has_true(x), so it’s not clear how you’d fix this problem with more normalization, without also breaking the intended use case.
In terms of the paper: Theorems 1 and 2 describe the distractor probe, and in particular they explicitly describe the probe as learning distractor(x)⊕has_true(x), though it doesn’t talk about why this defeats the normalization.
Note that the definition in that theorem is equivalent to p(xi)=1[x=x−i]⊕h(qi)=has_false(xi)⊕distractor(qi).
It sounds like you’re making an empirical claim that in this banana/shed example, the model is representing the features has_banana(x), has_true(x), and has_banana(x)⊕has_true(x) along linearly independent directions. Are you saying that this claim is supported by PCA visualizations you’ve done? Maybe I’m missing something, but none of the PCA visualizations I’m seeing in the paper seem to touch on this. E.g. visualization in figure 2(b) (reproduced below) is colored by is_true(x), not has_true(x). Are there other visualizations showing linear structure to the feature has_banana(x)⊕has_true(x) independent of the features has_banana(x) and has_true(x)? (I’ll say that I’ve done a lot of visualizing true/false datasets with PCA, and I’ve never noticed anything like this, though I never had as clean a distractor feature as banana/shed.)
More broadly, it seems like you’re saying that you think in general, when LLMs have linearly-represented features a and b they will also tend to linearly represent the feature a⊕b. Taking this as an empirical claim about current models, this would be shocking. (If this was meant to be a claim about a possible worst-case world, then it seems fine.)
For example, if I’ve done my geometry right, this would predict that if you train a supervised probe (e.g. with logistic regression) to classify a=0 vs 1 on a dataset where b=0, the resulting probe should get ~50% accuracy on a test dataset where b=1. And this should apply for any features a,b. But this is certainly not the typical case, at least as far as I can tell!
Concretely, if we were to prepare a dataset of 2-token prompts where the first word is always “true” or “false” and the second word is always “banana” or “shed,” do you predict that a probe trained with logistic regression on the dataset {(true banana,1),(false banana,0)} will have poor accuracy when tested on {(true shed,1),(false shed,1)}?
Are you saying that this claim is supported by PCA visualizations you’ve done?
Yes, but they’re not in the paper. (I also don’t remember if these visualizations were specifically on banana/shed or one of the many other distractor experiments we did.)
I’ll say that I’ve done a lot of visualizing true/false datasets with PCA, and I’ve never noticed anything like this, though I never had as clean a distractor feature as banana/shed.
It is important for the distractor to be clean (otherwise PCA might pick up on other sources of variance in the activations as the principal components).
More broadly, it seems like you’re saying that you think in general, when LLMs have linearly-represented features a and b they will also tend to linearly represent the feature a⊕b. Taking this as an empirical claim about current models, this would be shocking.
I don’t want to make a claim that this will always hold; models are messy and there could be lots of confounders that make it not hold in general. For example, the construction I mentioned uses 3 dimensions to represent 2 variables; maybe in some cases this is too expensive and the model just uses 2 dimensions and gives up the ability to linearly read arbitrary functions of those 2 variables. Maybe it’s usually not helpful to compute boolean functions of 2 boolean variables, but in the specific case where you have a statement followed by Yes / No it’s especially useful (e.g. because the truth value of the Yes / No is the XOR of No / Yes with the truth value of the previous sentence).
My guess is that this is a motif that will reoccur in other natural contexts as well. But we haven’t investigated this and I think of it as speculation.
For example, if I’ve done my geometry right, this would predict that if you train a supervised probe (e.g. with logistic regression) to classify a=0 vs 1 on a dataset where b=0, the resulting probe should get ~50% accuracy on a test dataset where b=1. And this should apply for any features a,b. But this is certainly not the typical case, at least as far as I can tell!
If you linearly represent a, b, and a⊕b, then given this training setup you could learn a classifier that detects the a direction or the a⊕b direction or some mixture between the two. In general I would expect that the a direction is more prominent / more salient / cleaner than the a⊕b direction, and so it would learn a classifier based on that, which would lead to ~100% accuracy on the test dataset.
If you use normalization to eliminate the a direction as done in CCS, then I expect you learn a classifier aligned with the a⊕b direction, and you get ~0% accuracy on the test dataset. This isn’t the typical result, but it also isn’t the typical setup; it’s uncommon to use normalization to eliminate particular directions.
(Similarly, if you don’t do the normalization step in CCS, my guess is that nearly all of our experiments would just show CCS learning the has_true(x) probe, rather than the has_true(x)⊕distractor(x) probe.)
Concretely, if we were to prepare a dataset of 2-token prompts where the first word is always “true” or “false” and the second word is always “banana” or “shed,” do you predict that a probe trained with logistic regression on the dataset {(true banana,1),(false banana,0)} will have poor accuracy when tested on {(true shed,1),(false shed,1)}?
These datasets are incredibly tiny (size two) so I’m worried about noise, but let’s say you pad the prompts with random sentences from some dataset to get larger datasets.
If you used normalization to remove the has_true direction, then yes, that’s what I’d predict. Without normalization I predict high test accuracy.
(Note there’s a typo in your test dataset—it should be (false shed,0).)
I see that you’ve unendorsed this, but my guess is that this is indeed what’s going on. That is, I’m guessing that the probe learned is p(x)=has_banana(x)⊕is_true(x) so that ~p(x)=has_banana(x). I was initially skeptical on the basis of the visualizations shown in the paper—it doesn’t look like a linear probe should be able to learn an xor like this. But if the true geometry is more like the figures below (from here) (+ a lateral displacement between the two datasets), then the linear probe can learn an xor just fine.
I am very confused about some of the reported experimental results.
Here’s my understanding the banana/shed experiment (section 4.1):
For half of the questions, the word “banana” was appended to both elements of the of the contrast pair x+ and x−. Likewise for the other half, the word “shed” was appended to both elements of the contrast pair.
Then a probe was trained with CCS on the dataset of contrast pairs {(x+,x−)}.
Sometimes, the result was the probe p(x)=has_banana(x) where has_banana(x)=1 if x ends with “banana” and 0 otherwise.
I am confused because this probe does not have low CCS loss. Namely, for each contrast pair (x+,x−) in this dataset, we would have p(x+)=p(x−) so that the consistency loss will be high. The identical confusion applies for my understanding of the “Alice thinks...” experiment.
To be clear, I’m not quite as confused about the PCA and k-means versions of this result: if the presence of “banana” or “shed” is not encoded strictly linearly, then maybe ~ϕ(x+)−~ϕ(x−) could still contain information about whether x+ and x− both end in “banana” or “shed.” I would also not be confused if you were claiming that CCS learned the probe p(x)=has_banana(x)⊕is_true(x) (which is the probe that your theorem 1 would produce in this setting); but this doesn’t seem to be what the claim is (and is not consistent with figure 2(a)).
Is the claim that the probe p(x)=has_banana(x) is learned despite it not getting low CCS loss? Or am I misunderstanding the experiment?
(To summarize the parallel thread)
The claim is that the learned probe is p(x)=has_banana(x)⊕has_false(x). As shown in Theorem 1, if you chug through the math with this probe, it gets low CCS loss and leads to an induced classifier ~p(x)=has_banana(q).*
You might be surprised that this is possible, because the CCS normalization is supposed to eliminate has_true(x) -- but what the normalization does is remove linearly-accessible information about has_true(x). However, has_banana(x)⊕has_true(x) is not linearly accessible, and it is encoded by the LLM using a near-orthogonal dimension of the residual stream, so it is not removed by the normalization.
*Notation:
q is a question or statement whose truth value we care about
x is one half of a contrast pair created from q
has_banana(q) is 1 if the statement ends with banana, and 0 if it ends with shed
has_false(x) is 1 if the contrast pair is negative (i.e. ends with “False” or “No”) and 0 if it is positive.
Let’s assume the prompt template is x= Q [true/false] [banana/shred]
If I understand correctly, they don’t claim p learned has_banana but ~p=p(x⁺)+(1−p(x⁻))2 learned has_banana. Moreover evaluating ~p for p=is_true(x)⊕is_shred(x) gives:
~p(x=Q [?] banana)=p(Q true banana)+(1−p(Q false banana))2=1+(1−0)2=1
~p(x=Q [?] shred)=p(Q true shred)+(1−p(Q false shred))2=0+(1−1)2=0
Therefore, we can learn a ~p that is a banana classifier
EDIT: Nevermind, I don’t think the above is a reasonable explanation of the results, see my reply to this comment.
Original comment:
Gotcha, that seems like a possible interpretation of the stuff that they wrote, though I find it a bit surprising that CCS learned the probe p(x)=has_banana(x)⊕is_true(x) (and think they should probably remark on this).
In particular, based on the dataset visualizations in the paper, it doesn’t seem possible for a linear probe to implement has_banana(x)⊕is_true(x). But it’s possible that if you were to go beyond the 3 dimensions shown the true geometry would look more like the following (from here) (+ a lateral displacement between the two datasets).
In this case, a linear probe could learn an xor just fine.
Actually, no, p(x)=has_banana(x)⊕is_true(x) would not result in ~p(x)=has_banana(x). To get that ~p you would need to take p(x)=has_banana(x)⊕has_true(x) where has_true(x)(≠is_true(x)) is determined by whether the word true is present (and not by whether “Q true” is true).
But I don’t think this should be possible: ~ϕ(x+),~ϕ(x−) are supposed to have their means subtracted off (thereby getting rid of the the linearly-accessible information about has_true(~ϕ(x±))).
The point is that while the normalization eliminates has_true(x), it does not eliminate has_banana(x)⊕has_true(x), and it turns out that LLMs really do encode the XOR linearly in the residual stream.
Why does the LLM do this? Suppose you have two boolean variables a and b. If the neural net uses three dimensions to represent a, b, and a⊕b, I believe that allows it to recover arbitrary boolean functions of a and b linearly from the residual stream. So you might expect the LLM to do this “by default” because of how useful it is for downstream computation. In such a setting, if you normalize based on a, that will remove the a direction, but it will not remove the b and a⊕b directions. Empirically when we do PCA visualizations this is what we observe.
Note that the intended behavior of CCS on e.g. IMDb is to learn the probe sentiment(x)⊕has_true(x), so it’s not clear how you’d fix this problem with more normalization, without also breaking the intended use case.
In terms of the paper: Theorems 1 and 2 describe the distractor probe, and in particular they explicitly describe the probe as learning distractor(x)⊕has_true(x), though it doesn’t talk about why this defeats the normalization.
Note that the definition in that theorem is equivalent to p(xi)=1[x=x−i]⊕h(qi)=has_false(xi)⊕distractor(qi).
Thanks! I’m still pretty confused though.
It sounds like you’re making an empirical claim that in this banana/shed example, the model is representing the features has_banana(x), has_true(x), and has_banana(x)⊕has_true(x) along linearly independent directions. Are you saying that this claim is supported by PCA visualizations you’ve done? Maybe I’m missing something, but none of the PCA visualizations I’m seeing in the paper seem to touch on this. E.g. visualization in figure 2(b) (reproduced below) is colored by is_true(x), not has_true(x). Are there other visualizations showing linear structure to the feature has_banana(x)⊕has_true(x) independent of the features has_banana(x) and has_true(x)? (I’ll say that I’ve done a lot of visualizing true/false datasets with PCA, and I’ve never noticed anything like this, though I never had as clean a distractor feature as banana/shed.)
More broadly, it seems like you’re saying that you think in general, when LLMs have linearly-represented features a and b they will also tend to linearly represent the feature a⊕b. Taking this as an empirical claim about current models, this would be shocking. (If this was meant to be a claim about a possible worst-case world, then it seems fine.)
For example, if I’ve done my geometry right, this would predict that if you train a supervised probe (e.g. with logistic regression) to classify a=0 vs 1 on a dataset where b=0, the resulting probe should get ~50% accuracy on a test dataset where b=1. And this should apply for any features a,b. But this is certainly not the typical case, at least as far as I can tell!
Concretely, if we were to prepare a dataset of 2-token prompts where the first word is always “true” or “false” and the second word is always “banana” or “shed,” do you predict that a probe trained with logistic regression on the dataset {(true banana,1),(false banana,0)} will have poor accuracy when tested on {(true shed,1),(false shed,1)}?
Yes, but they’re not in the paper. (I also don’t remember if these visualizations were specifically on banana/shed or one of the many other distractor experiments we did.)
It is important for the distractor to be clean (otherwise PCA might pick up on other sources of variance in the activations as the principal components).
I don’t want to make a claim that this will always hold; models are messy and there could be lots of confounders that make it not hold in general. For example, the construction I mentioned uses 3 dimensions to represent 2 variables; maybe in some cases this is too expensive and the model just uses 2 dimensions and gives up the ability to linearly read arbitrary functions of those 2 variables. Maybe it’s usually not helpful to compute boolean functions of 2 boolean variables, but in the specific case where you have a statement followed by Yes / No it’s especially useful (e.g. because the truth value of the Yes / No is the XOR of No / Yes with the truth value of the previous sentence).
My guess is that this is a motif that will reoccur in other natural contexts as well. But we haven’t investigated this and I think of it as speculation.
If you linearly represent a, b, and a⊕b, then given this training setup you could learn a classifier that detects the a direction or the a⊕b direction or some mixture between the two. In general I would expect that the a direction is more prominent / more salient / cleaner than the a⊕b direction, and so it would learn a classifier based on that, which would lead to ~100% accuracy on the test dataset.
If you use normalization to eliminate the a direction as done in CCS, then I expect you learn a classifier aligned with the a⊕b direction, and you get ~0% accuracy on the test dataset. This isn’t the typical result, but it also isn’t the typical setup; it’s uncommon to use normalization to eliminate particular directions.
(Similarly, if you don’t do the normalization step in CCS, my guess is that nearly all of our experiments would just show CCS learning the has_true(x) probe, rather than the has_true(x)⊕distractor(x) probe.)
These datasets are incredibly tiny (size two) so I’m worried about noise, but let’s say you pad the prompts with random sentences from some dataset to get larger datasets.
If you used normalization to remove the has_true direction, then yes, that’s what I’d predict. Without normalization I predict high test accuracy.
(Note there’s a typo in your test dataset—it should be (false shed,0).)
Thanks for the detailed replies!
I see that you’ve unendorsed this, but my guess is that this is indeed what’s going on. That is, I’m guessing that the probe learned is p(x)=has_banana(x)⊕is_true(x) so that ~p(x)=has_banana(x). I was initially skeptical on the basis of the visualizations shown in the paper—it doesn’t look like a linear probe should be able to learn an xor like this. But if the true geometry is more like the figures below (from here) (+ a lateral displacement between the two datasets), then the linear probe can learn an xor just fine.