I like this method, and I see that it can eliminate this kind of superposition. You already address the limitation, that these gated attention head blocks do not eliminate other forms of attention head superposition, and I agree. It feels kind of specifically designed to deal with the kind of superposition that occurs for Skip Trigrams and I would be interested to see how well it generalizes to superpositions in the wild.
I tried to come up with a list of ways attention head superposition that can not be disentangled by gated attention blocks:
multiple attention heads perform a distributed computation, that attends to different source tokens This was already addressed by you, and an example is given by Greenspan and Wynroe
The superposition is across attention heads on different layers These are not caught because the sparsity penalty is only applied to attention heads within the same layer. Why should there be superposition of attention heads between layers? As a toy model let us imagine the case of a 2 layer attention only transformer, with n_head heads in each layer, given a dataset with >n_head^2+n_head skip trigrams to figure out. Such a transformer could use the computation in superposition described in figure 1 to correctly model all skip trigrams, but would run out of attention head pairs within the same layer for distributing computation between. Then it would have to revert to putting attention head pairs across layers in superposition.
Overlapping necessary superposition. Let’s say, there is some computation, for wich you need two attention heads, attending to the same token position. The easiest example of a situation, where this is necessary is when you want to copy information from a source token, that is “bigger” than the head dimension. The transformer can then use 2 heads, to copy over twice as much information. Let us now imagine, there are 3 cases, where information has to be copied from the source token. A,B,C, and we have 3 heads: 1,2,3. and the information that has to be copied over can be stored in 2*d_head dimensions. Is there a way to solve this task? Yes! heads 1&2 work in superposition to copy the information in task A, 2&3 in task B and 3&1 in task C. In theory, we could make all attention heads monosemantic, by having a set of 6 attention heads, trained to perform the same computation: A: 1&2, B: 3&4, C:5&6. But the way that the L.6 norm is applied, it only tries to reduce the number of times, that 2 attention heads attend to the same token. And this happens the same amount for both possibilities where the computation happens.
Thank you for the comment! Yep that is correct, I think perhaps variants of this approach could still be useful for resolving other forms of superposition within a single attention layer but not currently across different layers.
I like this method, and I see that it can eliminate this kind of superposition.
You already address the limitation, that these gated attention head blocks do not eliminate other forms of attention head superposition, and I agree.
It feels kind of specifically designed to deal with the kind of superposition that occurs for Skip Trigrams and I would be interested to see how well it generalizes to superpositions in the wild.
I tried to come up with a list of ways attention head superposition that can not be disentangled by gated attention blocks:
multiple attention heads perform a distributed computation, that attends to different source tokens
This was already addressed by you, and an example is given by Greenspan and Wynroe
The superposition is across attention heads on different layers
These are not caught because the sparsity penalty is only applied to attention heads within the same layer.
Why should there be superposition of attention heads between layers?
As a toy model let us imagine the case of a 2 layer attention only transformer, with n_head heads in each layer, given a dataset with >n_head^2+n_head skip trigrams to figure out.
Such a transformer could use the computation in superposition described in figure 1 to correctly model all skip trigrams, but would run out of attention head pairs within the same layer for distributing computation between.
Then it would have to revert to putting attention head pairs across layers in superposition.
Overlapping necessary superposition.
Let’s say, there is some computation, for wich you need two attention heads, attending to the same token position.
The easiest example of a situation, where this is necessary is when you want to copy information from a source token, that is “bigger” than the head dimension. The transformer can then use 2 heads, to copy over twice as much information.
Let us now imagine, there are 3 cases, where information has to be copied from the source token. A,B,C, and we have 3 heads: 1,2,3. and the information that has to be copied over can be stored in 2*d_head dimensions. Is there a way to solve this task? Yes!
heads 1&2 work in superposition to copy the information in task A, 2&3 in task B and 3&1 in task C.
In theory, we could make all attention heads monosemantic, by having a set of 6 attention heads, trained to perform the same computation: A: 1&2, B: 3&4, C:5&6. But the way that the L.6 norm is applied, it only tries to reduce the number of times, that 2 attention heads attend to the same token. And this happens the same amount for both possibilities where the computation happens.
Thank you for the comment! Yep that is correct, I think perhaps variants of this approach could still be useful for resolving other forms of superposition within a single attention layer but not currently across different layers.