Hey, sorry for the (very) belated response—thanks for the comment! Your description of the problem set-up/model look right to me. FWIW this post was ~my first attempt at digging into something superpositon-related, so I think you’re right that it was being pretty sloppy/confused with the concept of “superposition”. I’ve since come around more to your perspective of polysemanticity/distributed representation/interference being insufficient for “true” superposition.
Re: your point about there existing simpler solutions—you’re totally right that for d-head >= 4, there exists a more straightforward n_head = 1 solution, I did try solving this problem on paper before training anything and arrived at the same thing as you
However we found that for d_head = 1, n_head = 2 the model could still solve the problem perfectly—in this case I think the problem is less trivial and it does rely on the kind of “conditional attention hierarchy” behaviour and the associated interference we talk about. When n_head = 2 and d_head >= 4 the model still prefers this approach over the more trivial method you outline—we included the plots from this experiment over the n_head = 2, d_head = 1 version because the plots were a bit easier to read and we felt made the same point, but in retrospect
Overall I’m a lot less impressed/interested by this work in retrospect largely for the reasons you point out here, however I think some of the qualitative behaviours we saw are still quite interesting, and have at least for me affected how I think about what kinds of things attention layers might be doing (although the lessons may not be new/interesting to others)
“Inverted attention preferences”: In almost all of our tests, the two heads learn to invert the order in which they attend to important tokens. If there are multiple important key-tokens that all need to be attended to, you really don’t want multiple heads attending to the same token and ignoring some, so the QK-circuits of heads may be arranged so they distribute responsibility in a mutually exclusive/exhaustive way. Obviously our toy example is an extreme case, but I think this mutual-information between QK-circuits is probably likely to exist in LLM’s, since “needing to attend to a lot of different context information simultaneously” is v. present in language
“Thinking of heads as copying information about entire contexts vs. specific tokens”: This is maybe more of a perspective-shift than anything, but I found it interesting that when a head attended to its “second favourite token”, it could safely not write to the logits of the completion implied by (second-favorite token, first-favorite token), because it can “infer” the first-favorite is not elsewhere in the context (or else it’d be attending there). Or in other words, when an OV-circuit is sent to a specific key-position, it’s able to exploit not just the information at the residual stream locally at that position, but also the information implied about the entire context by its QK-circuit. Again, this may largely just be a “frame-shift” thing, but it’s definitely informed how I think about the relationship between the QK- and OV-circuits and how independent/disconnected I should be thinking of them as
Hey, sorry for the (very) belated response—thanks for the comment! Your description of the problem set-up/model look right to me. FWIW this post was ~my first attempt at digging into something superpositon-related, so I think you’re right that it was being pretty sloppy/confused with the concept of “superposition”. I’ve since come around more to your perspective of polysemanticity/distributed representation/interference being insufficient for “true” superposition.
Re: your point about there existing simpler solutions—you’re totally right that for d-head >= 4, there exists a more straightforward n_head = 1 solution, I did try solving this problem on paper before training anything and arrived at the same thing as you
However we found that for d_head = 1, n_head = 2 the model could still solve the problem perfectly—in this case I think the problem is less trivial and it does rely on the kind of “conditional attention hierarchy” behaviour and the associated interference we talk about. When n_head = 2 and d_head >= 4 the model still prefers this approach over the more trivial method you outline—we included the plots from this experiment over the n_head = 2, d_head = 1 version because the plots were a bit easier to read and we felt made the same point, but in retrospect
Overall I’m a lot less impressed/interested by this work in retrospect largely for the reasons you point out here, however I think some of the qualitative behaviours we saw are still quite interesting, and have at least for me affected how I think about what kinds of things attention layers might be doing (although the lessons may not be new/interesting to others)
“Inverted attention preferences”: In almost all of our tests, the two heads learn to invert the order in which they attend to important tokens. If there are multiple important key-tokens that all need to be attended to, you really don’t want multiple heads attending to the same token and ignoring some, so the QK-circuits of heads may be arranged so they distribute responsibility in a mutually exclusive/exhaustive way. Obviously our toy example is an extreme case, but I think this mutual-information between QK-circuits is probably likely to exist in LLM’s, since “needing to attend to a lot of different context information simultaneously” is v. present in language
“Thinking of heads as copying information about entire contexts vs. specific tokens”: This is maybe more of a perspective-shift than anything, but I found it interesting that when a head attended to its “second favourite token”, it could safely not write to the logits of the completion implied by (second-favorite token, first-favorite token), because it can “infer” the first-favorite is not elsewhere in the context (or else it’d be attending there). Or in other words, when an OV-circuit is sent to a specific key-position, it’s able to exploit not just the information at the residual stream locally at that position, but also the information implied about the entire context by its QK-circuit. Again, this may largely just be a “frame-shift” thing, but it’s definitely informed how I think about the relationship between the QK- and OV-circuits and how independent/disconnected I should be thinking of them as