Thanks for writing this up. It seems like a valuable contribution to our understanding of one-layer transformers. I particularly like your toy example – it’s a good demonstration of how more complicated behavior can occur here.
For what it’s worth, I understand this behavior as competition between skip-trigrams. We introduce “skip-trigrams” as a way to think of pairs of entries in the OV and QK-circuit matrices. The QK-circuit describes how much the attention head wants to attend to a given token in the attention softmax and implement a particular skip-trigram. The phenomenon you describe occurs when there are multiple skip-trigrams present with different QK-circuit values.
An analogy I find useful for thinking about this is protein binding affinity in molecular biology. (I don’t know much about molecular biology – hopefully experts can forgive me if my analogy is naive!) Proteins have a propensity to bind to other proteins, just as attention heads have a propensity to attend between specific tokens and implement skip-trigrams. However, fully understanding the behavior requires remembering that when one protein has a higher binding affinity than another, it can “block” binding. This doesn’t mean that it’s incorrect to understand proteins as having binding affinity! Nor does it mean that skip-trigrams are the wrong way to understand one-layer models. It just means that in thinking about proteins (or skip-trigrams) one wants to keep in mind the possibility of second order interactions.
I do think your example is very clarifying about the kind of second order interactions that can occur with skip-trigrams! While I definitely knew “skip-trigrams compete for attention”, I hadn’t realized it could give rise to this behavior.
With that said, I get the sense that maybe you might have understood us to be making a stronger claim about skip-trigrams being independent which we didn’t intend. I’m sorry for any confusion here. We do talk about “independent skip-trigram models”. Here “independent” is modifying “models” – it’s referring to the fact that there are multiple attention heads implementing independent skip-trigram models. (This might seem trivial now, but we had spent an entire section on this point because many people didn’t realize this from the original concatenated version of the transformer equations.) Then “skip-trigram” is referring to the fact that the natural units of one-layer models are triplets of tokens. Although our introduction and section introduce this without more context, our actual discussion of skip-trigrams keeps referring back to the OV and QK-circuits, which is the mathematical model they’re trying to provide a language for talking about.
I’ve been meaning to add a number of correctives and clarifications to our papers – this is on the list, and we’ll link to your example!
(I’ll comment on your more general thesis regarding understanding models with respect to a specific distribution in a separate comment.)
Thanks for writing this up. It seems like a valuable contribution to our understanding of one-layer transformers. I particularly like your toy example – it’s a good demonstration of how more complicated behavior can occur here.
For what it’s worth, I understand this behavior as competition between skip-trigrams. We introduce “skip-trigrams” as a way to think of pairs of entries in the OV and QK-circuit matrices. The QK-circuit describes how much the attention head wants to attend to a given token in the attention softmax and implement a particular skip-trigram. The phenomenon you describe occurs when there are multiple skip-trigrams present with different QK-circuit values.
An analogy I find useful for thinking about this is protein binding affinity in molecular biology. (I don’t know much about molecular biology – hopefully experts can forgive me if my analogy is naive!) Proteins have a propensity to bind to other proteins, just as attention heads have a propensity to attend between specific tokens and implement skip-trigrams. However, fully understanding the behavior requires remembering that when one protein has a higher binding affinity than another, it can “block” binding. This doesn’t mean that it’s incorrect to understand proteins as having binding affinity! Nor does it mean that skip-trigrams are the wrong way to understand one-layer models. It just means that in thinking about proteins (or skip-trigrams) one wants to keep in mind the possibility of second order interactions.
I do think your example is very clarifying about the kind of second order interactions that can occur with skip-trigrams! While I definitely knew “skip-trigrams compete for attention”, I hadn’t realized it could give rise to this behavior.
With that said, I get the sense that maybe you might have understood us to be making a stronger claim about skip-trigrams being independent which we didn’t intend. I’m sorry for any confusion here. We do talk about “independent skip-trigram models”. Here “independent” is modifying “models” – it’s referring to the fact that there are multiple attention heads implementing independent skip-trigram models. (This might seem trivial now, but we had spent an entire section on this point because many people didn’t realize this from the original concatenated version of the transformer equations.) Then “skip-trigram” is referring to the fact that the natural units of one-layer models are triplets of tokens. Although our introduction and section introduce this without more context, our actual discussion of skip-trigrams keeps referring back to the OV and QK-circuits, which is the mathematical model they’re trying to provide a language for talking about.
I’ve been meaning to add a number of correctives and clarifications to our papers – this is on the list, and we’ll link to your example!
(I’ll comment on your more general thesis regarding understanding models with respect to a specific distribution in a separate comment.)