I’m having a bit of difficulty understanding the exact task/set up of this post, and so I have a few questions.
Here’s a summary of your post as I understand it:
In Anthropic’s Toy Model of Attention Head “Superposition”,[1] they consider a task where the model needs to use interference between heads to implement multiple skip trigrams. In particular, they consider, and call this “OV-incoherent, because the OV seems to need to use information “not present” in V of the source token. (This was incorrect, because you can implement their task perfectly using only a copying head and a negative copying head.) They call this attention head superposition because they didn’t understand the algorithm, and so mistakenly thought they needed more attention heads than you actually need to implement the task (to their credit, they point out their mistake in their July 2023 update, and give the two head construction).
In this work, you propose a model of “OV-coherent” superposition, where the OV still needs to use information “not present” at the attended to location and which also requires more skip trigrams than attention heads to implement. Namely, you consider learning sequences of the form [A] … [B] … [Readoff]-> [C], which cannot naturally be implemented via skip-trigrams (and instead needs to be implemented via what Neel calls hierarchical skip tri-grams or what I normally call just “interference”).
You construct your sequences as follows:
There are 12 tokens for the input and 10 “output tokens”. Presumably you parameterized it so that dvocab=12, and just reassigned the inputs? For the input sequence, you use one token [0] as the read-off token, 4 tokens [1-4] as signal tokens, and the rest [5-11] as noise tokens.
In general, you don’t bother training the model above all tokens except for above the read-off [0](I think it’s more likely you trained it to be uniform on other tokens, actually. But at most this just rescales the OV and QK circuits (EVOU and EQKE respectively), and so we can ignore it when analyzing the attention heads).
Above the read-off, you train the model to minimize cross entropy loss, using the labels:
0 → 1,1 present in sequence
1 → 1, 2 present in sequence
...
8 → 3, 4 present in sequence
9 → 4, 4 present in sequence
So for example, if you see the sequence [5] [1] [2] [11] [0], the model should assign a high logit to [1], if you see the sequence [7] [4] [10] [3] [0], the model should assign a high logit to [8], etc.
You find that models can indeed learn your sequences of the form [A] … [B] … [Readoff]-> [C], often by implementing constructive interference “between skip-trigrams” both between the two attention heads and within each single head.
Specifically, in your mainline model in the post, head 1 implements something like the following algorithm:
Attend to tokens in [1-4], but attend to token [1] the most, then [4], then [3], then [2]. Call this the order of head 1.
The head increases the logits corresponding to pairs containing the tokens it attends to, except for the pairs that contain tokens higher in the order. That is, when attending to token [1], increase the logits for outputs [0-3] (corresponding to the logits indicating that there’s a 1 present in the sequence) and decrease logits for outputs [4-9] (corresponding to all other logits). Similarly, when attending to 4, increase the logits for outputs [6], [8], and [9] (corresponding to logits indicating that there’s a 4 present but not a 1). When attending to 3, increase logits for outputs [5] and [7] (there’s a 3 but not a 1 or 4), and when attending to 2, increase logits for outputs [2], [3], [4]. In fact, it increases the logits that it attends to less strongly more, which partially cancels out the fact that it attends more to those logits.
So on the sequence [7] [4] [10] [3] [0], head 1 will increase the logits for [6], [8], [9] a lot and [5] and [7] a little, while suppressing all other logits.
Head 0 implements the same algorithm, but attends in order [2], [3], [4], [1] (the reverse of head 1).
That being said, it’s a lot less clean in terms of what it outputs, e.g. it slightly increases logits [7-9] if it sees a 1. This is probably for error correction/calibration reasons, increasing logits [7-9] helps cancel out the strong bias of head 1 in suppressing logits of [5-9].
On the sequence [7] [4] [10] [3] [0], head 0 increases the logits for [2], [7], [8] a lot and [3] 9] a little.
Adding together the two heads causes them to output the correct answer.
On the sequence [7] [4] [10] [3] [0], since both heads increase logit [8] a lot, and increase the other logits only a little, the model outputs [8] (corresponding to 3, 4 being in sequence).
You conclude that this is an example of a different kind of “attention head superposition”, because this task is implemented across two attention heads, even though it takes 10 skip trigrams to naively implement this task.
Questions/comments:
I’m not sure my understanding of the task is correct, does the description above seem right to you?
Assuming the description above is correct, it seems that there’s an easy algorithm for implementing this with one head.
When you see a token, increase the logits corresponding to pairs containing that token. Then, attend to all tokens in [1-4] uniformly.
You can explain this with skip-bigrams—the model needs to implement the 16 skip bigrams mapping each of 4 tokens to the 4 logits corresponding to a pair containing the token.
You need a slight correction to handle the case where there are two repeated tokens, so you in fact want to increase the logits non-uniformly, so as to assign slightly higher logits to the pair containing the attended to token twice.
though, if you trained the model to be uniform on all tokens except for [0], it’ll need to check for [0] when deciding to output a non-uniform logit and move this information from other tokens, so it needs to stash its “bigrams” in EVOU and not EU
It’s pretty easy to implement 16 skip-bigrams in a matrix of size 4 x 10 (you only need 16 non-zero entries out of 40 total entries). You want EVOU to look something like: 3 2 2 2 0 0 0 0 0 0 0 2 0 0 3 2 2 0 0 0 0 0 2 0 0 2 0 3 2 0 0 0 0 2 0 0 2 0 2 3 Then with EQKE uniform on[1-4] and 0 otherwise, the output of the head (attention-weighted EVOU) in cases where there are two different tokens in the input will be 4 on the true logit and 2 or 3 on the logits for pairs containing one of the tokens but not the other, and 0 on other bigrams. In cases where the same token appears twice, then you get 6 on the true logit, 4 on the three other pairs containing the token once, and 0 otherwise.[2] You can then scale EVOU upwards to decrease loss until weight decay kicks in.
In your case, you have two EVOUs of size 4 x 10 but which are constrained to be rank 5 due to d_head=5. This is part of why the model wants to split the computation evenly across both heads.
From eyeballing, adding together the two EVOUs indeed produces something akin to the above diagram.
Given you split the computation and the fact that EVOU being rank 5 for each head introduces non-zero bias/noise, you want the two heads to have opposite biases/noise terms such that they cancel out. This is why you see one head specializing in copying over 1, then 4, then 3, then 2, and the other 2 3 4 1.
This also explains your observation: “We were also surprised that this problem can be solved with one head, as long as d_head >= 4. Intuitively, once a head has enough dimensions to store every “interesting” token orthogonally, its OV circuit can simply learn to map each of these basis vectors to the corresponding completions.”
It makes sense why d_head >= 4 is required here, because you definitely cannot implement anything approaching the above EVOU with a rank 3 matrix (since you can’t even “tell apart” the 4 input tokens). Presumably the model can learn low-rank approximations of the above EVOU, though I don’t know how to construct them by hand.
It feels that you’re using the term interchangeably with “polysemanticity” or “decomposability”. But part of the challenge of superposition is that there are more sparse “things” the model wants to compute or store than it has “dimensions”/”components”, which means there’s no linear transformation of the input space that recovers all the features. This is meaningfully distinct from the case where the model wants to represent one thing across multiple components/dimensions for error correction or other computational efficiency reasons(i.e. see example 1 here), which are generally easier to handle using linear algebra techniques.
It feels like you’re claiming superposition because there are more skip trigrams than n_heads, is there a different kind of superposition I’m missing here?
I think your example in the post is not an example of superposition in the traditional sense (again assuming that my interpretation is correct), and is in fact not even true polysemanticity. Instead of each head representing >1 feature, the low-rank nature of your heads means that each head basically has to represent 0.5 features.
The example in the post is an example of superposition of skip trigrams, but it’s pretty easy to construct toy examples where any -- would you consider any example where you can’t represent the task with ⇐ nheads skip trigrams as an example of superposition?
Some nitpicks:
What is “nan” in the EVOU figure (in the chapter “OV circuit behaviour”)? I presume this is the (log-)sum(-exp) of the logits corresponding to outputs [9] and [10]?
It’s worth noting that (I’m pretty sure though I haven’t sat down to write the proof) as softmax attention is a non-polynomial function of inputs, 1-layer transformers with unbounded number of heads can implement arbitrary functions of the inputs. On the other hand, skip n-grams for any fixed n obviously are not universal (i.e. they can’t implement XOR, as in the example of the “1-layer transformers =/= skip trigrams post). So even theoretically (without constructing any examples), it seems unlikely that you should think of 1L transformers as only skip-trigrams, though whether or not this occurs often in real networks is an empirical question (to which I’m pretty sure the answer is yes, because e.g. copy suppression heads are a common motif).
Scare quotes are here because their example is really disanalogous to MLP superposition. IE as they point out in their second post, their task is well thought of as naturally being decomposed into two attention heads; and a model that has n >= 2 heads isn’t really “placing circuits in superposition” so much as doing a natural task decomposition that they didn’t think of.
In fact, it feels like that result is a cautionary tale that just because a model implements an algorithm in a non-basis aligned manner, does not mean the model is implementing an approximate algorithm that requires exploiting near-orthogonality in high-dimensionality space (the traditional kind of residual stream/MLP activation superposition), nor does it mean that the algorithm is “implementing more circuits than is feasible” (i.e. the sense that they try to construct in the May 2023 update). You might just not understand the algorithm the model is implementing!
Note that this construction isn’t optimal, in part because of the fact that output tokens corresponding to the same token occuring twice occur half as often as those with two different tokens, while this construction gets lower log loss in the one-token case as in the two distinct token case. But the qualitative analysis carries through regardless.
Hey, sorry for the (very) belated response—thanks for the comment! Your description of the problem set-up/model look right to me. FWIW this post was ~my first attempt at digging into something superpositon-related, so I think you’re right that it was being pretty sloppy/confused with the concept of “superposition”. I’ve since come around more to your perspective of polysemanticity/distributed representation/interference being insufficient for “true” superposition.
Re: your point about there existing simpler solutions—you’re totally right that for d-head >= 4, there exists a more straightforward n_head = 1 solution, I did try solving this problem on paper before training anything and arrived at the same thing as you
However we found that for d_head = 1, n_head = 2 the model could still solve the problem perfectly—in this case I think the problem is less trivial and it does rely on the kind of “conditional attention hierarchy” behaviour and the associated interference we talk about. When n_head = 2 and d_head >= 4 the model still prefers this approach over the more trivial method you outline—we included the plots from this experiment over the n_head = 2, d_head = 1 version because the plots were a bit easier to read and we felt made the same point, but in retrospect
Overall I’m a lot less impressed/interested by this work in retrospect largely for the reasons you point out here, however I think some of the qualitative behaviours we saw are still quite interesting, and have at least for me affected how I think about what kinds of things attention layers might be doing (although the lessons may not be new/interesting to others)
“Inverted attention preferences”: In almost all of our tests, the two heads learn to invert the order in which they attend to important tokens. If there are multiple important key-tokens that all need to be attended to, you really don’t want multiple heads attending to the same token and ignoring some, so the QK-circuits of heads may be arranged so they distribute responsibility in a mutually exclusive/exhaustive way. Obviously our toy example is an extreme case, but I think this mutual-information between QK-circuits is probably likely to exist in LLM’s, since “needing to attend to a lot of different context information simultaneously” is v. present in language
“Thinking of heads as copying information about entire contexts vs. specific tokens”: This is maybe more of a perspective-shift than anything, but I found it interesting that when a head attended to its “second favourite token”, it could safely not write to the logits of the completion implied by (second-favorite token, first-favorite token), because it can “infer” the first-favorite is not elsewhere in the context (or else it’d be attending there). Or in other words, when an OV-circuit is sent to a specific key-position, it’s able to exploit not just the information at the residual stream locally at that position, but also the information implied about the entire context by its QK-circuit. Again, this may largely just be a “frame-shift” thing, but it’s definitely informed how I think about the relationship between the QK- and OV-circuits and how independent/disconnected I should be thinking of them as
I’m having a bit of difficulty understanding the exact task/set up of this post, and so I have a few questions.
Here’s a summary of your post as I understand it:
In Anthropic’s Toy Model of Attention Head “Superposition”,[1] they consider a task where the model needs to use interference between heads to implement multiple skip trigrams. In particular, they consider, and call this “OV-incoherent, because the OV seems to need to use information “not present” in V of the source token. (This was incorrect, because you can implement their task perfectly using only a copying head and a negative copying head.) They call this attention head superposition because they didn’t understand the algorithm, and so mistakenly thought they needed more attention heads than you actually need to implement the task (to their credit, they point out their mistake in their July 2023 update, and give the two head construction).
In this work, you propose a model of “OV-coherent” superposition, where the OV still needs to use information “not present” at the attended to location and which also requires more skip trigrams than attention heads to implement. Namely, you consider learning sequences of the form [A] … [B] … [Readoff]-> [C], which cannot naturally be implemented via skip-trigrams (and instead needs to be implemented via what Neel calls hierarchical skip tri-grams or what I normally call just “interference”).
You construct your sequences as follows:
There are 12 tokens for the input and 10 “output tokens”. Presumably you parameterized it so that dvocab=12, and just reassigned the inputs? For the input sequence, you use one token [0] as the read-off token, 4 tokens [1-4] as signal tokens, and the rest [5-11] as noise tokens.
In general, you don’t bother training the model above all tokens except for above the read-off [0](I think it’s more likely you trained it to be uniform on other tokens, actually. But at most this just rescales the OV and QK circuits (EVOU and EQKE respectively), and so we can ignore it when analyzing the attention heads).
Above the read-off, you train the model to minimize cross entropy loss, using the labels:
0 → 1,1 present in sequence
1 → 1, 2 present in sequence
...
8 → 3, 4 present in sequence
9 → 4, 4 present in sequence
So for example, if you see the sequence [5] [1] [2] [11] [0], the model should assign a high logit to [1], if you see the sequence [7] [4] [10] [3] [0], the model should assign a high logit to [8], etc.
You find that models can indeed learn your sequences of the form [A] … [B] … [Readoff]-> [C], often by implementing constructive interference “between skip-trigrams” both between the two attention heads and within each single head.
Specifically, in your mainline model in the post, head 1 implements something like the following algorithm:
Attend to tokens in [1-4], but attend to token [1] the most, then [4], then [3], then [2]. Call this the order of head 1.
The head increases the logits corresponding to pairs containing the tokens it attends to, except for the pairs that contain tokens higher in the order. That is, when attending to token [1], increase the logits for outputs [0-3] (corresponding to the logits indicating that there’s a 1 present in the sequence) and decrease logits for outputs [4-9] (corresponding to all other logits). Similarly, when attending to 4, increase the logits for outputs [6], [8], and [9] (corresponding to logits indicating that there’s a 4 present but not a 1). When attending to 3, increase logits for outputs [5] and [7] (there’s a 3 but not a 1 or 4), and when attending to 2, increase logits for outputs [2], [3], [4]. In fact, it increases the logits that it attends to less strongly more, which partially cancels out the fact that it attends more to those logits.
So on the sequence [7] [4] [10] [3] [0], head 1 will increase the logits for [6], [8], [9] a lot and [5] and [7] a little, while suppressing all other logits.
Head 0 implements the same algorithm, but attends in order [2], [3], [4], [1] (the reverse of head 1).
That being said, it’s a lot less clean in terms of what it outputs, e.g. it slightly increases logits [7-9] if it sees a 1. This is probably for error correction/calibration reasons, increasing logits [7-9] helps cancel out the strong bias of head 1 in suppressing logits of [5-9].
On the sequence [7] [4] [10] [3] [0], head 0 increases the logits for [2], [7], [8] a lot and [3] 9] a little.
Adding together the two heads causes them to output the correct answer.
On the sequence [7] [4] [10] [3] [0], since both heads increase logit [8] a lot, and increase the other logits only a little, the model outputs [8] (corresponding to 3, 4 being in sequence).
You conclude that this is an example of a different kind of “attention head superposition”, because this task is implemented across two attention heads, even though it takes 10 skip trigrams to naively implement this task.
Questions/comments:
I’m not sure my understanding of the task is correct, does the description above seem right to you?
Assuming the description above is correct, it seems that there’s an easy algorithm for implementing this with one head.
When you see a token, increase the logits corresponding to pairs containing that token. Then, attend to all tokens in [1-4] uniformly.
You can explain this with skip-bigrams—the model needs to implement the 16 skip bigrams mapping each of 4 tokens to the 4 logits corresponding to a pair containing the token.
You need a slight correction to handle the case where there are two repeated tokens, so you in fact want to increase the logits non-uniformly, so as to assign slightly higher logits to the pair containing the attended to token twice.
though, if you trained the model to be uniform on all tokens except for [0], it’ll need to check for [0] when deciding to output a non-uniform logit and move this information from other tokens, so it needs to stash its “bigrams” in EVOU and not EU
It’s pretty easy to implement 16 skip-bigrams in a matrix of size 4 x 10 (you only need 16 non-zero entries out of 40 total entries). You want EVOU to look something like:
3 2 2 2 0 0 0 0 0 0
0 2 0 0 3 2 2 0 0 0
0 0 2 0 0 2 0 3 2 0
0 0 0 2 0 0 2 0 2 3
Then with EQKE uniform on[1-4] and 0 otherwise, the output of the head (attention-weighted EVOU) in cases where there are two different tokens in the input will be 4 on the true logit and 2 or 3 on the logits for pairs containing one of the tokens but not the other, and 0 on other bigrams. In cases where the same token appears twice, then you get 6 on the true logit, 4 on the three other pairs containing the token once, and 0 otherwise.[2] You can then scale EVOU upwards to decrease loss until weight decay kicks in.
In your case, you have two EVOUs of size 4 x 10 but which are constrained to be rank 5 due to d_head=5. This is part of why the model wants to split the computation evenly across both heads.
From eyeballing, adding together the two EVOUs indeed produces something akin to the above diagram.
Given you split the computation and the fact that EVOU being rank 5 for each head introduces non-zero bias/noise, you want the two heads to have opposite biases/noise terms such that they cancel out. This is why you see one head specializing in copying over 1, then 4, then 3, then 2, and the other 2 3 4 1.
This also explains your observation: “We were also surprised that this problem can be solved with one head, as long as d_head >= 4. Intuitively, once a head has enough dimensions to store every “interesting” token orthogonally, its OV circuit can simply learn to map each of these basis vectors to the corresponding completions.”
It makes sense why d_head >= 4 is required here, because you definitely cannot implement anything approaching the above EVOU with a rank 3 matrix (since you can’t even “tell apart” the 4 input tokens). Presumably the model can learn low-rank approximations of the above EVOU, though I don’t know how to construct them by hand.
So it seems to me that, if my understanding is correct, this is also not an example of “true” superposition, in the sense I distinguish here: https://www.lesswrong.com/posts/8EyCQKuWo6swZpagS/superposition-is-not-just-neuron-polysemanticity
What exactly do you mean by superposition?
It feels that you’re using the term interchangeably with “polysemanticity” or “decomposability”. But part of the challenge of superposition is that there are more sparse “things” the model wants to compute or store than it has “dimensions”/”components”, which means there’s no linear transformation of the input space that recovers all the features. This is meaningfully distinct from the case where the model wants to represent one thing across multiple components/dimensions for error correction or other computational efficiency reasons(i.e. see example 1 here), which are generally easier to handle using linear algebra techniques.
It feels like you’re claiming superposition because there are more skip trigrams than n_heads, is there a different kind of superposition I’m missing here?
I think your example in the post is not an example of superposition in the traditional sense (again assuming that my interpretation is correct), and is in fact not even true polysemanticity. Instead of each head representing >1 feature, the low-rank nature of your heads means that each head basically has to represent 0.5 features.
The example in the post is an example of superposition of skip trigrams, but it’s pretty easy to construct toy examples where any -- would you consider any example where you can’t represent the task with ⇐ nheads skip trigrams as an example of superposition?
Some nitpicks:
What is “nan” in the EVOU figure (in the chapter “OV circuit behaviour”)? I presume this is the (log-)sum(-exp) of the logits corresponding to outputs [9] and [10]?
It’s worth noting that (I’m pretty sure though I haven’t sat down to write the proof) as softmax attention is a non-polynomial function of inputs, 1-layer transformers with unbounded number of heads can implement arbitrary functions of the inputs. On the other hand, skip n-grams for any fixed n obviously are not universal (i.e. they can’t implement XOR, as in the example of the “1-layer transformers =/= skip trigrams post). So even theoretically (without constructing any examples), it seems unlikely that you should think of 1L transformers as only skip-trigrams, though whether or not this occurs often in real networks is an empirical question (to which I’m pretty sure the answer is yes, because e.g. copy suppression heads are a common motif).
Scare quotes are here because their example is really disanalogous to MLP superposition. IE as they point out in their second post, their task is well thought of as naturally being decomposed into two attention heads; and a model that has n >= 2 heads isn’t really “placing circuits in superposition” so much as doing a natural task decomposition that they didn’t think of.
In fact, it feels like that result is a cautionary tale that just because a model implements an algorithm in a non-basis aligned manner, does not mean the model is implementing an approximate algorithm that requires exploiting near-orthogonality in high-dimensionality space (the traditional kind of residual stream/MLP activation superposition), nor does it mean that the algorithm is “implementing more circuits than is feasible” (i.e. the sense that they try to construct in the May 2023 update). You might just not understand the algorithm the model is implementing!
If I were to speculate more, it seems like they were screwed over by continuing to think about one-layer attention model as a set of skip trigrams, which they are not. More poetically, if your “natural” basis isn’t natural, then of course your model won’t use your “natural” basis.
Note that this construction isn’t optimal, in part because of the fact that output tokens corresponding to the same token occuring twice occur half as often as those with two different tokens, while this construction gets lower log loss in the one-token case as in the two distinct token case. But the qualitative analysis carries through regardless.
Hey, sorry for the (very) belated response—thanks for the comment! Your description of the problem set-up/model look right to me. FWIW this post was ~my first attempt at digging into something superpositon-related, so I think you’re right that it was being pretty sloppy/confused with the concept of “superposition”. I’ve since come around more to your perspective of polysemanticity/distributed representation/interference being insufficient for “true” superposition.
Re: your point about there existing simpler solutions—you’re totally right that for d-head >= 4, there exists a more straightforward n_head = 1 solution, I did try solving this problem on paper before training anything and arrived at the same thing as you
However we found that for d_head = 1, n_head = 2 the model could still solve the problem perfectly—in this case I think the problem is less trivial and it does rely on the kind of “conditional attention hierarchy” behaviour and the associated interference we talk about. When n_head = 2 and d_head >= 4 the model still prefers this approach over the more trivial method you outline—we included the plots from this experiment over the n_head = 2, d_head = 1 version because the plots were a bit easier to read and we felt made the same point, but in retrospect
Overall I’m a lot less impressed/interested by this work in retrospect largely for the reasons you point out here, however I think some of the qualitative behaviours we saw are still quite interesting, and have at least for me affected how I think about what kinds of things attention layers might be doing (although the lessons may not be new/interesting to others)
“Inverted attention preferences”: In almost all of our tests, the two heads learn to invert the order in which they attend to important tokens. If there are multiple important key-tokens that all need to be attended to, you really don’t want multiple heads attending to the same token and ignoring some, so the QK-circuits of heads may be arranged so they distribute responsibility in a mutually exclusive/exhaustive way. Obviously our toy example is an extreme case, but I think this mutual-information between QK-circuits is probably likely to exist in LLM’s, since “needing to attend to a lot of different context information simultaneously” is v. present in language
“Thinking of heads as copying information about entire contexts vs. specific tokens”: This is maybe more of a perspective-shift than anything, but I found it interesting that when a head attended to its “second favourite token”, it could safely not write to the logits of the completion implied by (second-favorite token, first-favorite token), because it can “infer” the first-favorite is not elsewhere in the context (or else it’d be attending there). Or in other words, when an OV-circuit is sent to a specific key-position, it’s able to exploit not just the information at the residual stream locally at that position, but also the information implied about the entire context by its QK-circuit. Again, this may largely just be a “frame-shift” thing, but it’s definitely informed how I think about the relationship between the QK- and OV-circuits and how independent/disconnected I should be thinking of them as