It seems to me like the induction head mechanism as described in A Mathematical Framework is an example of just looking at what a part of a model does on a particular distribution, given that those heads also do some unspecified amount of non-induction behaviors with non-induction mechanisms, as eg discussed here https://www.alignmentforum.org/posts/Si52fuEGSJJTXW9zs/behavioral-and-mechanistic-definitions-often-confuse-ai . (Though there’s a big quantitative difference—the distribution where induction happens is way bigger than eg the distribution where IOI happens.) Do you agree?
I moderately disagree with this? I think most induction heads are at least primarily induction heads (and this points strongly at the underlying attentional features and circuits), although there may be some superposition going on. (I also think that the evidence you’re providing is mostly orthogonal to this argument.)
I think if you’re uncomfortable with induction heads, previous token heads (especially in larger models) are an even more crisp example of an attentional feature which appears, at least on casual inspection, to typically be monosematnically represented by attention heads. :)
As a meta point – I’ve left some thoughts below, but in general, I’d rather advance this dialogue by just writing future papers.
(1) The main evidence I have for thinking that induction heads (or previous token heads) are primarily implementing those attentional features is just informally looking at their behavior on lots of random dataset examples. This isn’t something I’ve done super rigorously, but I have a pretty strong sense that this is at least “the main thing”.
(2) I think there’s an important distinction between “imprecisely articulating a monosemantic feature” and “a neuron/attention head is polysemantic/doing multiple things”. For example, suppose I found a neuron and claimed it was a golden retriever detector. Later, it turns out that it’s a U-shaped floppy ear detector which fires for several species of dogs. In that situation, I would have misunderstood something – but the misunderstanding isn’t about the neuron doing multiple things, it’s about having had an incorrect theory of what the thing is.
It seems to me that your post is mostly refining the hypothesis of what the induction heads you are studying are – not showing that they do lots of unrelated things.
(3) I think our paper wasn’t very clear about this, but I don’t think your refinements of the induction heads was unexpected. (A) Although we thought that the specific induction head in the 2L model we studied only used a single QK composition term to implement a very simple induction pattern, we always thought that induction heads could do things like match [a][b][c]. Please see the below image with a diagram from when we introduced induction heads that shows richer pattern matching, and then text which describes the k-composition for [a][b] as the “minimal way to create an induction head”, and gives the QK-composition term to create an [a][b][c] matching case. (B) We also introduced induction heads as a sub-type of copying head, so them doing some general copying is also not very surprising – they’re a copying head which is guided by an induction heuristic. (Just as one observes “neuron splitting” creating more and more specific features as one scales a model, I expect we get “attentional feature splitting” creating more and more precise attentional features.)
(3.A) I think it’s exciting that you’ve been clarifying induction heads! I only wanted to bring these clarifications up here because I keep hearing it cited as evidence against the framework paper and against the idea of monosemantic structures we can understand.
(3.B) I should clarify that I do think we misunderstood the induction heads we were studying in the 2L models in the framework paper. This was due to a bug in the computation of low-rank Frobenius norms in a library I wrote. This is on a list of corrections I’m planning to make to our past papers. However, I don’t think this reflects our general understanding of induction heads. The model was chosen to be (as we understood it at the time) the simplest case study of attention head composition we could find, not a representative example of induction heads.
(4) I think attention heads can exhibit superposition. The story is probably a bit different than that of normal neurons, but – drawing on intuition from toy models – I’m generally inclined to think: (a) sufficiently important attentional features will be monosemantic, given enough model capacity; (b) given a privileged basis, there’s a borderline regime where important features mostly get a dedicated neuron/attention head; (c) this gradually degrades into being highly polysemantic and us not being able to understand things. (See this progression as an example that gives me intuition here.)
It’s hard to distinguish “monosemantic” and “slightly polysemantic with a strong primary feature”. I think it’s perfectly possible that induction heads are in the slightly polysemantic regime.
(5) Without prejudice to the question of “how monosemantic are induction heads?”, I do think that “mostly monosemantic” is enough to get many benefits.
(5.A) Background: I presently think of most circuit research as “case studies where we can study circuits without having resolved superposition, to help us build footholds and skills for when we have”. Mostly monosemantic is a good proxy in this case.
(5.B) Mostly monosemantic features / attentional features allow us to study what features exist in a model. A good example of this is the SoLU paper – we believe many of the neurons have other features hiding in correlated small activations, but it also seems like it’s revealing the most important features to us.
(5.C) Being mostly monosemantic also means that, for circuit analysis, interference with other circuits will be mild. As such, the naive circuit analysis tells you a lot about the general story (weights for other features will be proportionally smaller). For contrast, compare this to a situation where one believes they’ve found a neuron (say a “divisible by seven” number detector, continuing my analogy above!) and it turns out that actually, that neuron mostly does other things on a broader distribution (and they even cause stronger activations!). Now, I need to be much more worried about my understanding…
(I also think that the evidence you’re providing is mostly orthogonal to this argument.)
Upon further consideration, I think you’re probably right that the causal scrubbing results I pointed at aren’t actually about the question we were talking about, my mistake.
but in general, I’d rather advance this dialogue by just writing future papers
Seems like probably the optimal strategy. Thanks again for your thoughts here.
I’m sympathetic to many of your concerns here.
It seems to me like the induction head mechanism as described in A Mathematical Framework is an example of just looking at what a part of a model does on a particular distribution, given that those heads also do some unspecified amount of non-induction behaviors with non-induction mechanisms, as eg discussed here https://www.alignmentforum.org/posts/Si52fuEGSJJTXW9zs/behavioral-and-mechanistic-definitions-often-confuse-ai . (Though there’s a big quantitative difference—the distribution where induction happens is way bigger than eg the distribution where IOI happens.) Do you agree?
I moderately disagree with this? I think most induction heads are at least primarily induction heads (and this points strongly at the underlying attentional features and circuits), although there may be some superposition going on. (I also think that the evidence you’re providing is mostly orthogonal to this argument.)
I think if you’re uncomfortable with induction heads, previous token heads (especially in larger models) are an even more crisp example of an attentional feature which appears, at least on casual inspection, to typically be monosematnically represented by attention heads. :)
As a meta point – I’ve left some thoughts below, but in general, I’d rather advance this dialogue by just writing future papers.
(1) The main evidence I have for thinking that induction heads (or previous token heads) are primarily implementing those attentional features is just informally looking at their behavior on lots of random dataset examples. This isn’t something I’ve done super rigorously, but I have a pretty strong sense that this is at least “the main thing”.
(2) I think there’s an important distinction between “imprecisely articulating a monosemantic feature” and “a neuron/attention head is polysemantic/doing multiple things”. For example, suppose I found a neuron and claimed it was a golden retriever detector. Later, it turns out that it’s a U-shaped floppy ear detector which fires for several species of dogs. In that situation, I would have misunderstood something – but the misunderstanding isn’t about the neuron doing multiple things, it’s about having had an incorrect theory of what the thing is.
It seems to me that your post is mostly refining the hypothesis of what the induction heads you are studying are – not showing that they do lots of unrelated things.
(3) I think our paper wasn’t very clear about this, but I don’t think your refinements of the induction heads was unexpected. (A) Although we thought that the specific induction head in the 2L model we studied only used a single QK composition term to implement a very simple induction pattern, we always thought that induction heads could do things like match [a][b][c]. Please see the below image with a diagram from when we introduced induction heads that shows richer pattern matching, and then text which describes the k-composition for [a][b] as the “minimal way to create an induction head”, and gives the QK-composition term to create an [a][b][c] matching case. (B) We also introduced induction heads as a sub-type of copying head, so them doing some general copying is also not very surprising – they’re a copying head which is guided by an induction heuristic. (Just as one observes “neuron splitting” creating more and more specific features as one scales a model, I expect we get “attentional feature splitting” creating more and more precise attentional features.)
(3.A) I think it’s exciting that you’ve been clarifying induction heads! I only wanted to bring these clarifications up here because I keep hearing it cited as evidence against the framework paper and against the idea of monosemantic structures we can understand.
(3.B) I should clarify that I do think we misunderstood the induction heads we were studying in the 2L models in the framework paper. This was due to a bug in the computation of low-rank Frobenius norms in a library I wrote. This is on a list of corrections I’m planning to make to our past papers. However, I don’t think this reflects our general understanding of induction heads. The model was chosen to be (as we understood it at the time) the simplest case study of attention head composition we could find, not a representative example of induction heads.
(4) I think attention heads can exhibit superposition. The story is probably a bit different than that of normal neurons, but – drawing on intuition from toy models – I’m generally inclined to think: (a) sufficiently important attentional features will be monosemantic, given enough model capacity; (b) given a privileged basis, there’s a borderline regime where important features mostly get a dedicated neuron/attention head; (c) this gradually degrades into being highly polysemantic and us not being able to understand things. (See this progression as an example that gives me intuition here.)
It’s hard to distinguish “monosemantic” and “slightly polysemantic with a strong primary feature”. I think it’s perfectly possible that induction heads are in the slightly polysemantic regime.
(5) Without prejudice to the question of “how monosemantic are induction heads?”, I do think that “mostly monosemantic” is enough to get many benefits.
(5.A) Background: I presently think of most circuit research as “case studies where we can study circuits without having resolved superposition, to help us build footholds and skills for when we have”. Mostly monosemantic is a good proxy in this case.
(5.B) Mostly monosemantic features / attentional features allow us to study what features exist in a model. A good example of this is the SoLU paper – we believe many of the neurons have other features hiding in correlated small activations, but it also seems like it’s revealing the most important features to us.
(5.C) Being mostly monosemantic also means that, for circuit analysis, interference with other circuits will be mild. As such, the naive circuit analysis tells you a lot about the general story (weights for other features will be proportionally smaller). For contrast, compare this to a situation where one believes they’ve found a neuron (say a “divisible by seven” number detector, continuing my analogy above!) and it turns out that actually, that neuron mostly does other things on a broader distribution (and they even cause stronger activations!). Now, I need to be much more worried about my understanding…
Upon further consideration, I think you’re probably right that the causal scrubbing results I pointed at aren’t actually about the question we were talking about, my mistake.
Seems like probably the optimal strategy. Thanks again for your thoughts here.