Regarding the more general question of “how much should interpretability make reference to the data distribution?”, here are a few thoughts:
Firstly, I think we should obviously make use of the data distribution to some extent (and much of my work has done so!). If you’re trying to reverse engineer a regular computer program, it’s extremely useful to have traces of that program running. So too with neural networks!
However, the fundamental thing I care about is understanding whether models will be safe off-distribution, so an understanding which is tied to a specific distribution – and especially to a narrow distribution – is less clear in how it advances my core goals. Explanations which hold narrowly but break off distribution are one of my biggest worries for interpretability, and a big part of why I’ve taken the mechanistic approach rather than picking low-hanging fruit in correlational interpretability. I’m much more worried about explanations only holding on narrow distributions than I am about incomplete global explanations—this is probably a significant implicit motivator of my research taste. (Caveat: I’m reluctantly okay with certain aspects of understanding being built on the entire training distribution when we have a compelling theoretical argument for why this captures everything and will generalize.)
Let’s return to my example of protein binding affinities from my other comment and imagine two different descriptions of the situation:
The “global story” – We have a table of binding affinities. When one protein has a much higher binding affinity than the other, it outcompetes it.
The “on distribution story” – We have a table of proteins which “block” other proteins in practice.
The global story is a kind of “unbiased account of the mechanism” which requires us to think through more possibilities, but can predict weird out of distribution behavior. On the other hand, the “on distribution story” highlights the aspects of the mechanism which are important in practice, but might fail in weird situations.
But what do we want from the on-distribution analysis?
One easy answer is that we just want to use it to make mechanistic understanding easier. Neural networks are immensely complicated computer programs. It seems to me that even understanding small neural networks is probably comparable to something like “reverse engineer a compiled linux kernel knowing nothing about operating systems”. It’s very helpful to have examples of it running to kind of bootstrap your analysis.
But I think there’s something deeper which you’re getting at, which I might articulate as distinguishing which aspects of a neural network’s mechanistic behavior are “deliberate or useful” and which are “bugs or quirks”. For example, in the framework paper we highlight some skip-trigrams which appear to be bugs:
Of course, distinguishing between “correct” skip tri-grams and “bug” skip-trigrams required our judgment based on understanding the domain. In an impartial account of the mechanism, they’re all valid skip-trigrams the model implements. It’s only with reference to the training distribution or some other external distribution or task that we can think of some as “correct” and others as “bugs”.
By more explicitly analyzing on a distribution, one might automate this kind of differentiation. And possibly, one might just ignore these (especially to the extent that other heads or the bigrams can compensate in practice!). This could make a simpler “explanation” at the cost of not generalizing to other distributions.
(In this particular case, I suspect there might actually be a more beautiful, non-distribution specific story to be told in terms of superposition. But that’s another topic.)
One interesting thing this suggests is that a “global story” should be able to be “bound” to a distribution to create an in-distribution account. For example, if one has a list of binding affinities for different chemicals, and knows that only a certain subset will be present at the same time, one can produce a summary of which will block each other.
While we’re on the topic, it’s perhaps useful to more directly describe my concerns about distribution-specific understanding of models, and especially narrow-distribution understanding of the kind a lot of work building Causal Scrubbing seems to be focusing on.
It seems to me that this kind of work is very vulnerable to producing fragile understandings of models which break on a wider distribution due to interpretability illusion type issues.
As one concrete example from my own experience, in the early days of Anthropic I looked into how language models perform arithmetic by only looking at model behavior only on arithmetic expressions. Immediately, lots of interesting patterns popped out and some interesting partial stories began to emerge. However, as soon as I returned to the full training distribution, the story fell apart. All the components I thought did something were doing other things – often primarily doing other things – on the full distribution. Of course, this was a very casual investigation and not anywhere near as rigorous as the causal scrubbing work. But while I’m sure there were ways my understanding on distribution was incomplete, I’m 100x more worried about the fact that it was clearly misleading about the general situation. (My strong suspicion is that there is a very nice story here, but it’s deeply intertwined with superposition and we can’t understand it without addressing that.)
With that said, I’m very excited for people to be taking different approaches to these problems. My concerns could be misplaced! I definitely think that restricting to a narrow distribution allows one to make a lot of progress on that type of understanding.
It seems to me like the induction head mechanism as described in A Mathematical Framework is an example of just looking at what a part of a model does on a particular distribution, given that those heads also do some unspecified amount of non-induction behaviors with non-induction mechanisms, as eg discussed here https://www.alignmentforum.org/posts/Si52fuEGSJJTXW9zs/behavioral-and-mechanistic-definitions-often-confuse-ai . (Though there’s a big quantitative difference—the distribution where induction happens is way bigger than eg the distribution where IOI happens.) Do you agree?
I moderately disagree with this? I think most induction heads are at least primarily induction heads (and this points strongly at the underlying attentional features and circuits), although there may be some superposition going on. (I also think that the evidence you’re providing is mostly orthogonal to this argument.)
I think if you’re uncomfortable with induction heads, previous token heads (especially in larger models) are an even more crisp example of an attentional feature which appears, at least on casual inspection, to typically be monosematnically represented by attention heads. :)
As a meta point – I’ve left some thoughts below, but in general, I’d rather advance this dialogue by just writing future papers.
(1) The main evidence I have for thinking that induction heads (or previous token heads) are primarily implementing those attentional features is just informally looking at their behavior on lots of random dataset examples. This isn’t something I’ve done super rigorously, but I have a pretty strong sense that this is at least “the main thing”.
(2) I think there’s an important distinction between “imprecisely articulating a monosemantic feature” and “a neuron/attention head is polysemantic/doing multiple things”. For example, suppose I found a neuron and claimed it was a golden retriever detector. Later, it turns out that it’s a U-shaped floppy ear detector which fires for several species of dogs. In that situation, I would have misunderstood something – but the misunderstanding isn’t about the neuron doing multiple things, it’s about having had an incorrect theory of what the thing is.
It seems to me that your post is mostly refining the hypothesis of what the induction heads you are studying are – not showing that they do lots of unrelated things.
(3) I think our paper wasn’t very clear about this, but I don’t think your refinements of the induction heads was unexpected. (A) Although we thought that the specific induction head in the 2L model we studied only used a single QK composition term to implement a very simple induction pattern, we always thought that induction heads could do things like match [a][b][c]. Please see the below image with a diagram from when we introduced induction heads that shows richer pattern matching, and then text which describes the k-composition for [a][b] as the “minimal way to create an induction head”, and gives the QK-composition term to create an [a][b][c] matching case. (B) We also introduced induction heads as a sub-type of copying head, so them doing some general copying is also not very surprising – they’re a copying head which is guided by an induction heuristic. (Just as one observes “neuron splitting” creating more and more specific features as one scales a model, I expect we get “attentional feature splitting” creating more and more precise attentional features.)
(3.A) I think it’s exciting that you’ve been clarifying induction heads! I only wanted to bring these clarifications up here because I keep hearing it cited as evidence against the framework paper and against the idea of monosemantic structures we can understand.
(3.B) I should clarify that I do think we misunderstood the induction heads we were studying in the 2L models in the framework paper. This was due to a bug in the computation of low-rank Frobenius norms in a library I wrote. This is on a list of corrections I’m planning to make to our past papers. However, I don’t think this reflects our general understanding of induction heads. The model was chosen to be (as we understood it at the time) the simplest case study of attention head composition we could find, not a representative example of induction heads.
(4) I think attention heads can exhibit superposition. The story is probably a bit different than that of normal neurons, but – drawing on intuition from toy models – I’m generally inclined to think: (a) sufficiently important attentional features will be monosemantic, given enough model capacity; (b) given a privileged basis, there’s a borderline regime where important features mostly get a dedicated neuron/attention head; (c) this gradually degrades into being highly polysemantic and us not being able to understand things. (See this progression as an example that gives me intuition here.)
It’s hard to distinguish “monosemantic” and “slightly polysemantic with a strong primary feature”. I think it’s perfectly possible that induction heads are in the slightly polysemantic regime.
(5) Without prejudice to the question of “how monosemantic are induction heads?”, I do think that “mostly monosemantic” is enough to get many benefits.
(5.A) Background: I presently think of most circuit research as “case studies where we can study circuits without having resolved superposition, to help us build footholds and skills for when we have”. Mostly monosemantic is a good proxy in this case.
(5.B) Mostly monosemantic features / attentional features allow us to study what features exist in a model. A good example of this is the SoLU paper – we believe many of the neurons have other features hiding in correlated small activations, but it also seems like it’s revealing the most important features to us.
(5.C) Being mostly monosemantic also means that, for circuit analysis, interference with other circuits will be mild. As such, the naive circuit analysis tells you a lot about the general story (weights for other features will be proportionally smaller). For contrast, compare this to a situation where one believes they’ve found a neuron (say a “divisible by seven” number detector, continuing my analogy above!) and it turns out that actually, that neuron mostly does other things on a broader distribution (and they even cause stronger activations!). Now, I need to be much more worried about my understanding…
(I also think that the evidence you’re providing is mostly orthogonal to this argument.)
Upon further consideration, I think you’re probably right that the causal scrubbing results I pointed at aren’t actually about the question we were talking about, my mistake.
but in general, I’d rather advance this dialogue by just writing future papers
Seems like probably the optimal strategy. Thanks again for your thoughts here.
While we’re on the topic, it’s perhaps useful to more directly describe my concerns about distribution-specific understanding of models, and especially narrow-distribution understanding of the kind a lot of work building Causal Scrubbing seems to be focusing on.
Can I summarize your concerns as something like “I’m not sure that looking into the behavior of “real” models on narrow distributions is any better research than just training a small toy model on that narrow distribution and interpreting it?”
Or perhaps you think it’s slightly better, but not considerably?
If so, I mostly agree—it doesn’t very clear this is much better. I’m something like into:
Picking a distribution
Training a model to perform well on that distribution
Interpreting the model (or parts of the model, etc)
as a default interpretability workflow.
For instance, it’s not very clear to me that IOI is much more interesting that just training a model on some version of the IOI distribution and then interpreting that model. And I think a key problem with IOI is that the model doesn’t really care very much about doing well on this exact task: after having skimmed though copious amounts[1] of OpenWebText, the IOI task as exactly formulated seems pretty non-central IMO.
There are various arguments for looking into narrow examples IMO, but the case is a bit more subtle. (For instance it seems like we should ideally be able to answer questions like ‘why did this model have strange behavior in this narrow distribution’ where the ‘why’ will probably have to make reference to how the model behaves on a broader distribution of interest)
It’s also possible we disagree about how useful it is to do interpretability on toy tasks. I’m not really sure if there’s anything interesting and quick to say here.
Can I summarize your concerns as something like “I’m not sure that looking into the behavior of “real” models on narrow distributions is any better research than just training a small toy model on that narrow distribution and interpreting it?” Or perhaps you think it’s slightly better, but not considerably?
Between the two, I might actually prefer training a toy model on a narrow distribution! But it depends a lot on exactly how the analysis is done and what lessons one wants to draw from it.
Real language models seem to make extensive use of superposition. I expect there to be lots of circuits superimposed with the one you’re studying, and I worry that studying it on a narrow distribution may give a misleading impression – as soon as you move to a broader distribution, overlapping features and circuits which you previously missed may activate, and your understanding may in fact be misleading.
On the other hand, for a model just trained on a toy task, I think your understanding is likely closer to the truth of what’s going on in that model. If you’re studying it over the whole training distribution, features either aren’t in superposition (there’s so much free capacity in most of these models this seem possible!) or else they’ll be part of the unexplained loss, in your language. So choosing to use a toy model is just a question of what that model teaches you about real models (for example, you’ve kind of side-stepped superposition, and it’s also unclear to what extent the features and circuits in a toy model represent the larger model). But it seems much clearer what is true, and it also seems much clearer that these limitations exist.
Regarding the more general question of “how much should interpretability make reference to the data distribution?”, here are a few thoughts:
Firstly, I think we should obviously make use of the data distribution to some extent (and much of my work has done so!). If you’re trying to reverse engineer a regular computer program, it’s extremely useful to have traces of that program running. So too with neural networks!
However, the fundamental thing I care about is understanding whether models will be safe off-distribution, so an understanding which is tied to a specific distribution – and especially to a narrow distribution – is less clear in how it advances my core goals. Explanations which hold narrowly but break off distribution are one of my biggest worries for interpretability, and a big part of why I’ve taken the mechanistic approach rather than picking low-hanging fruit in correlational interpretability. I’m much more worried about explanations only holding on narrow distributions than I am about incomplete global explanations—this is probably a significant implicit motivator of my research taste. (Caveat: I’m reluctantly okay with certain aspects of understanding being built on the entire training distribution when we have a compelling theoretical argument for why this captures everything and will generalize.)
Let’s return to my example of protein binding affinities from my other comment and imagine two different descriptions of the situation:
The “global story” – We have a table of binding affinities. When one protein has a much higher binding affinity than the other, it outcompetes it.
The “on distribution story” – We have a table of proteins which “block” other proteins in practice.
The global story is a kind of “unbiased account of the mechanism” which requires us to think through more possibilities, but can predict weird out of distribution behavior. On the other hand, the “on distribution story” highlights the aspects of the mechanism which are important in practice, but might fail in weird situations.
But what do we want from the on-distribution analysis?
One easy answer is that we just want to use it to make mechanistic understanding easier. Neural networks are immensely complicated computer programs. It seems to me that even understanding small neural networks is probably comparable to something like “reverse engineer a compiled linux kernel knowing nothing about operating systems”. It’s very helpful to have examples of it running to kind of bootstrap your analysis.
But I think there’s something deeper which you’re getting at, which I might articulate as distinguishing which aspects of a neural network’s mechanistic behavior are “deliberate or useful” and which are “bugs or quirks”. For example, in the framework paper we highlight some skip-trigrams which appear to be bugs:
Of course, distinguishing between “correct” skip tri-grams and “bug” skip-trigrams required our judgment based on understanding the domain. In an impartial account of the mechanism, they’re all valid skip-trigrams the model implements. It’s only with reference to the training distribution or some other external distribution or task that we can think of some as “correct” and others as “bugs”.
By more explicitly analyzing on a distribution, one might automate this kind of differentiation. And possibly, one might just ignore these (especially to the extent that other heads or the bigrams can compensate in practice!). This could make a simpler “explanation” at the cost of not generalizing to other distributions.
(In this particular case, I suspect there might actually be a more beautiful, non-distribution specific story to be told in terms of superposition. But that’s another topic.)
One interesting thing this suggests is that a “global story” should be able to be “bound” to a distribution to create an in-distribution account. For example, if one has a list of binding affinities for different chemicals, and knows that only a certain subset will be present at the same time, one can produce a summary of which will block each other.
While we’re on the topic, it’s perhaps useful to more directly describe my concerns about distribution-specific understanding of models, and especially narrow-distribution understanding of the kind a lot of work building Causal Scrubbing seems to be focusing on.
It seems to me that this kind of work is very vulnerable to producing fragile understandings of models which break on a wider distribution due to interpretability illusion type issues.
As one concrete example from my own experience, in the early days of Anthropic I looked into how language models perform arithmetic by only looking at model behavior only on arithmetic expressions. Immediately, lots of interesting patterns popped out and some interesting partial stories began to emerge. However, as soon as I returned to the full training distribution, the story fell apart. All the components I thought did something were doing other things – often primarily doing other things – on the full distribution. Of course, this was a very casual investigation and not anywhere near as rigorous as the causal scrubbing work. But while I’m sure there were ways my understanding on distribution was incomplete, I’m 100x more worried about the fact that it was clearly misleading about the general situation. (My strong suspicion is that there is a very nice story here, but it’s deeply intertwined with superposition and we can’t understand it without addressing that.)
With that said, I’m very excited for people to be taking different approaches to these problems. My concerns could be misplaced! I definitely think that restricting to a narrow distribution allows one to make a lot of progress on that type of understanding.
I’m sympathetic to many of your concerns here.
It seems to me like the induction head mechanism as described in A Mathematical Framework is an example of just looking at what a part of a model does on a particular distribution, given that those heads also do some unspecified amount of non-induction behaviors with non-induction mechanisms, as eg discussed here https://www.alignmentforum.org/posts/Si52fuEGSJJTXW9zs/behavioral-and-mechanistic-definitions-often-confuse-ai . (Though there’s a big quantitative difference—the distribution where induction happens is way bigger than eg the distribution where IOI happens.) Do you agree?
I moderately disagree with this? I think most induction heads are at least primarily induction heads (and this points strongly at the underlying attentional features and circuits), although there may be some superposition going on. (I also think that the evidence you’re providing is mostly orthogonal to this argument.)
I think if you’re uncomfortable with induction heads, previous token heads (especially in larger models) are an even more crisp example of an attentional feature which appears, at least on casual inspection, to typically be monosematnically represented by attention heads. :)
As a meta point – I’ve left some thoughts below, but in general, I’d rather advance this dialogue by just writing future papers.
(1) The main evidence I have for thinking that induction heads (or previous token heads) are primarily implementing those attentional features is just informally looking at their behavior on lots of random dataset examples. This isn’t something I’ve done super rigorously, but I have a pretty strong sense that this is at least “the main thing”.
(2) I think there’s an important distinction between “imprecisely articulating a monosemantic feature” and “a neuron/attention head is polysemantic/doing multiple things”. For example, suppose I found a neuron and claimed it was a golden retriever detector. Later, it turns out that it’s a U-shaped floppy ear detector which fires for several species of dogs. In that situation, I would have misunderstood something – but the misunderstanding isn’t about the neuron doing multiple things, it’s about having had an incorrect theory of what the thing is.
It seems to me that your post is mostly refining the hypothesis of what the induction heads you are studying are – not showing that they do lots of unrelated things.
(3) I think our paper wasn’t very clear about this, but I don’t think your refinements of the induction heads was unexpected. (A) Although we thought that the specific induction head in the 2L model we studied only used a single QK composition term to implement a very simple induction pattern, we always thought that induction heads could do things like match [a][b][c]. Please see the below image with a diagram from when we introduced induction heads that shows richer pattern matching, and then text which describes the k-composition for [a][b] as the “minimal way to create an induction head”, and gives the QK-composition term to create an [a][b][c] matching case. (B) We also introduced induction heads as a sub-type of copying head, so them doing some general copying is also not very surprising – they’re a copying head which is guided by an induction heuristic. (Just as one observes “neuron splitting” creating more and more specific features as one scales a model, I expect we get “attentional feature splitting” creating more and more precise attentional features.)
(3.A) I think it’s exciting that you’ve been clarifying induction heads! I only wanted to bring these clarifications up here because I keep hearing it cited as evidence against the framework paper and against the idea of monosemantic structures we can understand.
(3.B) I should clarify that I do think we misunderstood the induction heads we were studying in the 2L models in the framework paper. This was due to a bug in the computation of low-rank Frobenius norms in a library I wrote. This is on a list of corrections I’m planning to make to our past papers. However, I don’t think this reflects our general understanding of induction heads. The model was chosen to be (as we understood it at the time) the simplest case study of attention head composition we could find, not a representative example of induction heads.
(4) I think attention heads can exhibit superposition. The story is probably a bit different than that of normal neurons, but – drawing on intuition from toy models – I’m generally inclined to think: (a) sufficiently important attentional features will be monosemantic, given enough model capacity; (b) given a privileged basis, there’s a borderline regime where important features mostly get a dedicated neuron/attention head; (c) this gradually degrades into being highly polysemantic and us not being able to understand things. (See this progression as an example that gives me intuition here.)
It’s hard to distinguish “monosemantic” and “slightly polysemantic with a strong primary feature”. I think it’s perfectly possible that induction heads are in the slightly polysemantic regime.
(5) Without prejudice to the question of “how monosemantic are induction heads?”, I do think that “mostly monosemantic” is enough to get many benefits.
(5.A) Background: I presently think of most circuit research as “case studies where we can study circuits without having resolved superposition, to help us build footholds and skills for when we have”. Mostly monosemantic is a good proxy in this case.
(5.B) Mostly monosemantic features / attentional features allow us to study what features exist in a model. A good example of this is the SoLU paper – we believe many of the neurons have other features hiding in correlated small activations, but it also seems like it’s revealing the most important features to us.
(5.C) Being mostly monosemantic also means that, for circuit analysis, interference with other circuits will be mild. As such, the naive circuit analysis tells you a lot about the general story (weights for other features will be proportionally smaller). For contrast, compare this to a situation where one believes they’ve found a neuron (say a “divisible by seven” number detector, continuing my analogy above!) and it turns out that actually, that neuron mostly does other things on a broader distribution (and they even cause stronger activations!). Now, I need to be much more worried about my understanding…
Upon further consideration, I think you’re probably right that the causal scrubbing results I pointed at aren’t actually about the question we were talking about, my mistake.
Seems like probably the optimal strategy. Thanks again for your thoughts here.
(Context, I work at Redwood)
Can I summarize your concerns as something like “I’m not sure that looking into the behavior of “real” models on narrow distributions is any better research than just training a small toy model on that narrow distribution and interpreting it?” Or perhaps you think it’s slightly better, but not considerably?
If so, I mostly agree—it doesn’t very clear this is much better. I’m something like into:
Picking a distribution
Training a model to perform well on that distribution
Interpreting the model (or parts of the model, etc)
as a default interpretability workflow.
For instance, it’s not very clear to me that IOI is much more interesting that just training a model on some version of the IOI distribution and then interpreting that model. And I think a key problem with IOI is that the model doesn’t really care very much about doing well on this exact task: after having skimmed though copious amounts[1] of OpenWebText, the IOI task as exactly formulated seems pretty non-central IMO.
There are various arguments for looking into narrow examples IMO, but the case is a bit more subtle. (For instance it seems like we should ideally be able to answer questions like ‘why did this model have strange behavior in this narrow distribution’ where the ‘why’ will probably have to make reference to how the model behaves on a broader distribution of interest)
It’s also possible we disagree about how useful it is to do interpretability on toy tasks. I’m not really sure if there’s anything interesting and quick to say here.
I’ve perhaps skimmed somewhere between 10,000 to 100,000 passages? (I haven’t counted)
Between the two, I might actually prefer training a toy model on a narrow distribution! But it depends a lot on exactly how the analysis is done and what lessons one wants to draw from it.
Real language models seem to make extensive use of superposition. I expect there to be lots of circuits superimposed with the one you’re studying, and I worry that studying it on a narrow distribution may give a misleading impression – as soon as you move to a broader distribution, overlapping features and circuits which you previously missed may activate, and your understanding may in fact be misleading.
On the other hand, for a model just trained on a toy task, I think your understanding is likely closer to the truth of what’s going on in that model. If you’re studying it over the whole training distribution, features either aren’t in superposition (there’s so much free capacity in most of these models this seem possible!) or else they’ll be part of the unexplained loss, in your language. So choosing to use a toy model is just a question of what that model teaches you about real models (for example, you’ve kind of side-stepped superposition, and it’s also unclear to what extent the features and circuits in a toy model represent the larger model). But it seems much clearer what is true, and it also seems much clearer that these limitations exist.