Collection of some mech interp knowledge about transformers:
Writing up folk wisdom & recent results, mostly for mentees and as a link to send to people. Aimed at people who are already a bit familiar with mech interp. I’ve just quickly written down what came to my head, and may have missed or misrepresented some things. In particular, the last point is very brief and deserves a much more expanded comment at some point. The opinions expressed here are my own and do not necessarily reflect the views of Apollo Research.
Transformers take in a sequence of tokens, and return logprob predictions for the next token. We think it works like this:
Activations represent a sum of feature directions, each direction representing to some semantic concept. The magnitude of directions corresponds to the strength or importance of the concept.
These features may be 1-dimensional, but maybe multi-dimensional features make sense too. We can either allow for multi-dimensional features (e.g. circle of days of the week), acknowledge that the relative directions of feature embeddings matter (e.g. considering days of the week individual features but span a circle), or both. See also Jake Mendel’s post.
The concepts may be “linearly” encoded, in the sense that two concepts A and B being present (say with strengths α and β) are represented as α*vector_A + β*vector_B). This is the key assumption of linear representation hypothesis. See Chris Olah & Adam Jermyn but also Lewis Smith.
The residual stream of a transformer stores information the model needs later. Attention and MLP layers read from and write to this residual stream. Think of it as a kind of “shared memory”, with this picture in your head, from Anthropic’s famous AMFTC.
This residual stream seems to slowly accumulate information throughout the forward pass, as suggested by LogitLens.
Maybe think of each transformer block / layer as doing a serial step of computation. Though note that layers don’t need to be privileged points between computational steps, a computation can be spread out over layers (Lee Sharkey’s CLDR arguments, Anthropic’s Crosscoder-motivation image)
Superposition. There can be more features than dimensions in the vector space, corresponding to almost-orthogonal directions. Established in Anthropic’s TMS. You can have a mix as well. See Chris Olah’s post on distributed representations for a nice write-up.
Superposition requires sparsity, i.e. that only few features are active at a time.
The model starts with token (and positional) embeddings.
We think token embeddings mostly store features that might be relevant about a given token (e.g. words in which it occurs and what concepts they represent). The meaning of a token depends a lot on context.
We think positional embeddings are pretty simple (in GPT2-small, but likely also other models). In GPT2-small they appear to encode ~4 dimensions worth of positional information, consisting of “is this the first token”, “how late in the sequence is it”, plus two sinusoidal directions. The latter three create a helix.
PS: If you try to train an SAE on the full embedding you’ll find this helix split up into segments (“buckets”) as individual features (e.g. here). Pay attention to this bucket-ing as a sign of compositional representation.
The overall Transformer computation is said to start with detokenization: accumulating context and converting the pure token representation into a context-aware representation of the meaning of the text. Early layers in models often behave differently from the rest. Lad et al. claim three more distinct stages but that’s not consensus.
There’s a couple of common motifs we see in LLM internals, such as
LLMs implementing human-interpretable algorithms.
Induction heads (paper, good illustration): attention heads being used to repeat sequences seen previously in context. This can reach from literally repeating text to maybe being generally responsible for in-context learning.
Indirect object identification, docstring completion. Importantly don’t take these early circuits works to mean “we actually found the circuit in the model” but rather take away “here is a way you could implement this algorithm in a transformer” and maybe the real implementation looks something like it.
In general we don’t think this manual analysis scales to big models (see e.g. Tom Lieberum’s paper)
Also we want to automate the process, e.g. ACDC and follow-ups (1, 2).
My personal take is that all circuits analysis is currently not promising because circuits are not crisp. With this I mean the observation that a few distinct components don’t seem to be sufficient to explain a behaviour, and you need to add more and more components, slowly explaining more and more performance. This clearly points towards us not using the right units to decompose the model. Thus, model decomposition is the major area of mech interp research right now.
Moving information. Information is moved around in the residual stream, from one token position to another. This is what we see in typical residual stream patching experiments, e.g. here.
Information storage. Early work (e.g. Mor Geva) suggests that MLPs can store information as key-value memories; generally folk wisdom is that MLPs store facts. However, those facts seem to be distributed and non-trivial to localise (see ROME & follow-ups, e.g. MEMIT). The DeepMind mech interp team tried and wasn’t super happy with their results.
Logical gates. We think models calculate new features from existing features by computing e.g. AND and OR gates. Here we show a bunch of features that look like that is happening, and the papers by Hoagy Cunningham & Sam Marks show computational graphs for some example features.
There are hypotheses on what layer norm could be responsible for, but it can’t do anything substantial since you can run models without it (e.g. TinyModel, GPT2_noLN)
(Sparse) circuits agenda. The current mainstream agenda in mech interp (see e.g. Chris Olah’s recent talk) is to (1) find the right components to decompose model activations, to (2) understand the interactions between these features, and to finally (3) understand the full model.
The first big open problem is how to do this decomposition correctly. There’s plenty of evidence that the current Sparse Autoencoders (SAEs) don’t give us the correct solution, as well as conceptual issues. I’ll not go into the details here to keep this short-ish.
The second big open problem is that the interactions, by default, don’t seem sparse. This is expected if there are multiple ways (e.g. SAE sizes) to decompose a layer, and adjacent layers aren’t decomposed correspondingly. In practice this means that one SAE feature seems to affect many many SAE features in the next layers, more than we can easily understand. Plus, those interactions seem to be not crisp which leads to the same issue as described above.
I think this is what most mech interp researchers more or less think. Though I definitely expect many researchers would disagree with individual points, nor does it fairly weigh all views and aspects (it’s very biased towards “people I talk to”). (Also this is in no way an Apollo / Apollo interp team statement, just my personal view.)
Collection of some mech interp knowledge about transformers:
Writing up folk wisdom & recent results, mostly for mentees and as a link to send to people. Aimed at people who are already a bit familiar with mech interp. I’ve just quickly written down what came to my head, and may have missed or misrepresented some things. In particular, the last point is very brief and deserves a much more expanded comment at some point. The opinions expressed here are my own and do not necessarily reflect the views of Apollo Research.
Transformers take in a sequence of tokens, and return logprob predictions for the next token. We think it works like this:
Activations represent a sum of feature directions, each direction representing to some semantic concept. The magnitude of directions corresponds to the strength or importance of the concept.
These features may be 1-dimensional, but maybe multi-dimensional features make sense too. We can either allow for multi-dimensional features (e.g. circle of days of the week), acknowledge that the relative directions of feature embeddings matter (e.g. considering days of the week individual features but span a circle), or both. See also Jake Mendel’s post.
The concepts may be “linearly” encoded, in the sense that two concepts A and B being present (say with strengths α and β) are represented as α*vector_A + β*vector_B). This is the key assumption of linear representation hypothesis. See Chris Olah & Adam Jermyn but also Lewis Smith.
The residual stream of a transformer stores information the model needs later. Attention and MLP layers read from and write to this residual stream. Think of it as a kind of “shared memory”, with this picture in your head, from Anthropic’s famous AMFTC.
This residual stream seems to slowly accumulate information throughout the forward pass, as suggested by LogitLens.
Additionally, we expect there to be internally-relevant information inside the residual stream, such as whether the sequence of nouns in a sentence is ABBA or BABA.
Maybe think of each transformer block / layer as doing a serial step of computation. Though note that layers don’t need to be privileged points between computational steps, a computation can be spread out over layers (Lee Sharkey’s CLDR arguments, Anthropic’s Crosscoder-motivation image)
Superposition. There can be more features than dimensions in the vector space, corresponding to almost-orthogonal directions. Established in Anthropic’s TMS. You can have a mix as well. See Chris Olah’s post on distributed representations for a nice write-up.
Superposition requires sparsity, i.e. that only few features are active at a time.
The model starts with token (and positional) embeddings.
We think token embeddings mostly store features that might be relevant about a given token (e.g. words in which it occurs and what concepts they represent). The meaning of a token depends a lot on context.
We think positional embeddings are pretty simple (in GPT2-small, but likely also other models). In GPT2-small they appear to encode ~4 dimensions worth of positional information, consisting of “is this the first token”, “how late in the sequence is it”, plus two sinusoidal directions. The latter three create a helix.
PS: If you try to train an SAE on the full embedding you’ll find this helix split up into segments (“buckets”) as individual features (e.g. here). Pay attention to this bucket-ing as a sign of compositional representation.
The overall Transformer computation is said to start with detokenization: accumulating context and converting the pure token representation into a context-aware representation of the meaning of the text. Early layers in models often behave differently from the rest. Lad et al. claim three more distinct stages but that’s not consensus.
There’s a couple of common motifs we see in LLM internals, such as
LLMs implementing human-interpretable algorithms.
Induction heads (paper, good illustration): attention heads being used to repeat sequences seen previously in context. This can reach from literally repeating text to maybe being generally responsible for in-context learning.
Indirect object identification, docstring completion. Importantly don’t take these early circuits works to mean “we actually found the circuit in the model” but rather take away “here is a way you could implement this algorithm in a transformer” and maybe the real implementation looks something like it.
In general we don’t think this manual analysis scales to big models (see e.g. Tom Lieberum’s paper)
Also we want to automate the process, e.g. ACDC and follow-ups (1, 2).
My personal take is that all circuits analysis is currently not promising because circuits are not crisp. With this I mean the observation that a few distinct components don’t seem to be sufficient to explain a behaviour, and you need to add more and more components, slowly explaining more and more performance. This clearly points towards us not using the right units to decompose the model. Thus, model decomposition is the major area of mech interp research right now.
Moving information. Information is moved around in the residual stream, from one token position to another. This is what we see in typical residual stream patching experiments, e.g. here.
Information storage. Early work (e.g. Mor Geva) suggests that MLPs can store information as key-value memories; generally folk wisdom is that MLPs store facts. However, those facts seem to be distributed and non-trivial to localise (see ROME & follow-ups, e.g. MEMIT). The DeepMind mech interp team tried and wasn’t super happy with their results.
Logical gates. We think models calculate new features from existing features by computing e.g. AND and OR gates. Here we show a bunch of features that look like that is happening, and the papers by Hoagy Cunningham & Sam Marks show computational graphs for some example features.
Activation size & layer norm. GPT2-style transformers have a layer normalization layer before every Attn and MLP block. Also, the norm of activations grows throughout the forward pass. Combined this means old features become less important over time, Alex Turner has thoughts on this.
There are hypotheses on what layer norm could be responsible for, but it can’t do anything substantial since you can run models without it (e.g. TinyModel, GPT2_noLN)
(Sparse) circuits agenda. The current mainstream agenda in mech interp (see e.g. Chris Olah’s recent talk) is to (1) find the right components to decompose model activations, to (2) understand the interactions between these features, and to finally (3) understand the full model.
The first big open problem is how to do this decomposition correctly. There’s plenty of evidence that the current Sparse Autoencoders (SAEs) don’t give us the correct solution, as well as conceptual issues. I’ll not go into the details here to keep this short-ish.
The second big open problem is that the interactions, by default, don’t seem sparse. This is expected if there are multiple ways (e.g. SAE sizes) to decompose a layer, and adjacent layers aren’t decomposed correspondingly. In practice this means that one SAE feature seems to affect many many SAE features in the next layers, more than we can easily understand. Plus, those interactions seem to be not crisp which leads to the same issue as described above.
this is great, thanks for sharing
Thanks for the great writeup.
Typo: I think you meant to write distributed, not local, codes. A local code is the opposite of superposition.
Thanks! You’re right, totally mixed up local and dense / distributed. Decided to just leave out that terminology
Who is “we”? Is it:
only you and your team?
the entire Apollo Research org?
the majority of mechinterp researchers worldwide?
some other group/category of people?
Also, this definitely deserves to be made into a high-level post, if you end up finding the time/energy/interest in making one.
Thanks for the comment!
I think this is what most mech interp researchers more or less think. Though I definitely expect many researchers would disagree with individual points, nor does it fairly weigh all views and aspects (it’s very biased towards “people I talk to”). (Also this is in no way an Apollo / Apollo interp team statement, just my personal view.)