[Proposal] Attention Transcoders: can we take attention heads out of superposition?
Note: This thinking is cached from before the bilinear sparse autoencoders paper. I need to read that and revisit my thoughts here.
Primer: Attention-Head Superposition
Attention-head superposition (AHS) was introduced in this Anthropic post from 2023. Briefly, AHS is the idea that models may use a small number of attention heads to approximate the effect of having many more attention heads.
Definition 1: OV-incoherence. An attention circuit is OV-incoherent if it attends from multiple different tokens back to a single token, and the output depends on the token attended from.
Example 2: Skip-trigram circuits. A skip trigram consists of a sequence [A]...[B] → [C], where A, B, C are distinct tokens.
Claim 3: A single head cannot implement multiple OV-incoherent circuits. Recall from A Mathematical Framework that an attention head can be decomposed into the OV circuit and the QK circuit, which operate independently. Within each head, the OV circuit is solely responsible for mapping linear directions in the input to linear directions in the output. only the query token. Since it does not see the key token, it must compute a fixed function of the query.
Claim 4: Models compute many OV-incoherent circuits simultaneously in superposition. If the ground-truth data is best explained by a large number of OV-incoherent circuits, then models will approximate having these circuits by placing them in superposition across their limited number of attention heads.
Attention Transcoders
An attention transcoder (ATC) is described as follows:
An ATC attempts to reconstruct the input and output of a specific attention block
An ATC is simply a standard multi-head attention module, except that it has many more attention heads.
An ATC is regularised during training such that the number of active heads is sparse.
I’ve left this intentionally vague at the moment as I’m uncertain how exactly to do this.
Remark 5: The ATC architecture is the generalization of other successful SAE-like architectures to attention blocks.
Residual-stream SAEs simulate a model that has many more residual neurons.
MLP transcoders simulate a model that has many more hidden neurons in its MLP.
ATCs simulate a model that has many more attention heads.
Remark 6: Intervening on ATC heads. Since the ATC reconstructs the output of an attention block, ablations can be done by simply splicing the ATC into the model’s computational graph and intervening directly on individual head outputs.
Remark 7: Attributing ATC heads to ground-truth heads. In standard attention-out SAEs, it’s possible to directly compute the attribution of each head to an SAE feature. That seems impossible here because the ATC head outputs are not direct functions of the ground-truth heads. Nonetheless, if ATC heads seem highly interpretable and accurately reconstruct the real attention outputs, and specific predictions can be verified via interventions, it seems reasonable to conclude that they are a good explanation of how attention blocks are working.
Key uncertainties
Does AHS actually occur in language models? I think we do not have crisp examples at the moment.
Concrete experiments
The first and most obvious experiment is to try training an ATC and see if it works.
Scaling milestones: toy models, TinyStories, open web text
Do we achieve better Pareto curves of reconstruction loss vs L0 vs standard attention-out SAEs?
Conditional on that succeeding, the next step would be to attempt to interpret individual heads in an ATC and determine whether they are interpretable.
It may be useful to compare to known examples of suspected AHS; however, direct comparison is difficult due to Remark 7 above.
[Proposal] Attention Transcoders: can we take attention heads out of superposition?
Note: This thinking is cached from before the bilinear sparse autoencoders paper. I need to read that and revisit my thoughts here.
Primer: Attention-Head Superposition
Attention-head superposition (AHS) was introduced in this Anthropic post from 2023. Briefly, AHS is the idea that models may use a small number of attention heads to approximate the effect of having many more attention heads.
Definition 1: OV-incoherence. An attention circuit is OV-incoherent if it attends from multiple different tokens back to a single token, and the output depends on the token attended from.
Example 2: Skip-trigram circuits. A skip trigram consists of a sequence [A]...[B] → [C], where A, B, C are distinct tokens.
Claim 3: A single head cannot implement multiple OV-incoherent circuits. Recall from A Mathematical Framework that an attention head can be decomposed into the OV circuit and the QK circuit, which operate independently. Within each head, the OV circuit is solely responsible for mapping linear directions in the input to linear directions in the output. only the query token. Since it does not see the key token, it must compute a fixed function of the query.
Claim 4: Models compute many OV-incoherent circuits simultaneously in superposition. If the ground-truth data is best explained by a large number of OV-incoherent circuits, then models will approximate having these circuits by placing them in superposition across their limited number of attention heads.
Attention Transcoders
An attention transcoder (ATC) is described as follows:
An ATC attempts to reconstruct the input and output of a specific attention block
An ATC is simply a standard multi-head attention module, except that it has many more attention heads.
An ATC is regularised during training such that the number of active heads is sparse.
I’ve left this intentionally vague at the moment as I’m uncertain how exactly to do this.
Remark 5: The ATC architecture is the generalization of other successful SAE-like architectures to attention blocks.
Residual-stream SAEs simulate a model that has many more residual neurons.
MLP transcoders simulate a model that has many more hidden neurons in its MLP.
ATCs simulate a model that has many more attention heads.
Remark 6: Intervening on ATC heads. Since the ATC reconstructs the output of an attention block, ablations can be done by simply splicing the ATC into the model’s computational graph and intervening directly on individual head outputs.
Remark 7: Attributing ATC heads to ground-truth heads. In standard attention-out SAEs, it’s possible to directly compute the attribution of each head to an SAE feature. That seems impossible here because the ATC head outputs are not direct functions of the ground-truth heads. Nonetheless, if ATC heads seem highly interpretable and accurately reconstruct the real attention outputs, and specific predictions can be verified via interventions, it seems reasonable to conclude that they are a good explanation of how attention blocks are working.
Key uncertainties
Does AHS actually occur in language models? I think we do not have crisp examples at the moment.
Concrete experiments
The first and most obvious experiment is to try training an ATC and see if it works.
Scaling milestones: toy models, TinyStories, open web text
Do we achieve better Pareto curves of reconstruction loss vs L0 vs standard attention-out SAEs?
Conditional on that succeeding, the next step would be to attempt to interpret individual heads in an ATC and determine whether they are interpretable.
It may be useful to compare to known examples of suspected AHS; however, direct comparison is difficult due to Remark 7 above.