We would love to see more ideas & hypotheses on why the model might be doing this, as well as attempts to test this! We mainly wrote-up this post because both Alex and I independently noticed this and weren’t aware of this previously, so we wanted to make a reference post.
Happy to provide! I think I’m pretty interested in testing this/working on this in the future. Currently a bit tied up but I think (as Alex hints at) there could be some big implications for interpretability here.
TLDR: Documenting existing circuits is good but explaining what relationship circuits have to each other within the model, such as by understanding how the model allocated limited resources such as residual stream and weights between different learnable circuit seems important.
The general topic I think we are getting at is something like “circuit economics”. The thing I’m trying to gesture at is that while circuits might deliver value in distinct ways (such as reducing loss on different inputs, activating on distinct patterns), they share capacity in weights (see polysemantic and capacity in neural networks) and I guess “bandwidth” (getting penalized for interfering signals in activations). There are a few reasons why I think this feels like economics which include: scarce resources, value chains (features composed of other features) and competition (if a circuit is predicting something well with one heuristic, maybe there will be smaller gradient updates to encourage another circuit learning a different heuristic to emerge).
So to tie this back to your post and Alex’s comment “which seems like it would cut away exponentially many virtual heads? That would be awfully convenient for interpretability.”. I think that what interpretability has recently dealt with in elucidating specific circuits is something like “micro-interpretability” and is akin to microeconomics. However this post seems to show a larger trend ie “macro-interpretability” which would possibly affect which of such circuits are possible/likely to be in the final model.
I’ll elaborate briefly on the off chance this seems like it might be a useful analogy/framing to motivate further work.
Studying the Capacity/Loss Reduction distribution in Time: It seems like during transformer training there may be an effect not unlike inflation? Circuits which delivered enough value to justify their capacity use early in training may fall below the capacity/loss reduction cut off later. Maybe various techniques which enable us to train more robust models work because they make these transitions easier.
Studying the Capacity/Loss Reduction distribution in Layer: Moreover, it seems plausible that the distribution of “usefulness” in circuits in different layers of the network may be far from uniform. Circuits later in the network have far more refined inputs which make them better at reducing loss. Residual stream norm growth seems like a “macro” effect that shows model “know” that later layers are more important.
Studying the Capacity/Loss Reduction distribution in Layer and Time: Combining the above. I’d predict that neural networks originally start by having valuable circuits in many layers but then transition to maintain circuits earlier in the network which are valuable to many downstream circuits and circuits later in the network which make the best use of earlier circuits.
More generally “circuit economics” as a framing seems to suggest that there are different types of “goods” in the transformer economy. those which directly lead to better predictions and those which are useful for making better predictions when integrated with other features. The success of Logit Lens seems to suggest that the latter category increases over the course of the layers. Maybe this is the only kind of good in which case transformers would be “fundamentally interpretable” in some sense. All intermediate signals could be interpreted as final products. More likely, I think is that later in training there are ways to reinforce the creation of more internal goods (in economics, good which are used to make other goods are called capital goods). The value of such goods would be mediated via later circuits. So this would lead also to the “deletion-by-magnitude theory” as a way or removing internal goods.
To bring this back to language already in the field see Neel’s discussion here. A modular circuit is distinct from an end-end circuit in that it starts and ends in intermediate activations. Modular circuits may be composable. I propose that the outputs of such circuits are “capital goods”. If we think about the “circuit economy” it then seems totally reasonable that multiple suppliers might generate equivalent capital goods and have a many to many relationship multiple different circuits near the end voting on logits.
This is very speculative “theory” if you can call it that, but I guess I feel this would be “big if true”. I also make no claims about this being super original or actually that useful in practice but it does feel intuition generating. I think this is totally the kind of thing people might have worked on sooner but it’s likely been historically hard to measure the kinds of things that might be relevant. What your post shows is that between the transformer circuits framework and TransformerLens we are able to somewhat quickly take a bunch of interesting measurements relatively quickly which may provide more traction on this than previously possible.
I read TurnTrout’s summary, of this plan, so this may be entirely unrelated, but the recent paper Generalizing Backpropagation for Gradient-Based Interpretability (video) seems like a good tool for this brand of interpretability work. May want to reach out to the authors to prove the viability of your paradigm and their methods, or just use their methods directly.
More generally “circuit economics” as a framing seems to suggest that there are different types of “goods” in the transformer economy. those which directly lead to better predictions and those which are useful for making better predictions when integrated with other features. The success of Logit Lens seems to suggest that the latter category increases over the course of the layers. Maybe this is the only kind of good in which case transformers would be “fundamentally interpretable” in some sense. All intermediate signals could be interpreted as final products.
Can you say more on this point? The latter kind of good (useful when integrated with other features) doesn’t necessarily imply that direct unembed (logit lens) or learned linear unembed (tuned lens iirc) would be able to extract use from such goods. I suspect that I probably just missed your point, though.
Sure, I could have phrased myself better and I meant to say “former”, which didn’t help either!
Neither of these are novel concepts in that existing investigations have described features of this nature.
Good 1 aka Consumer goods. Useful for unembed (may / may not be useful for other modular circuits inside the network. That Logit Lens gets better over the course of the circuit suggests the residual stream contains these kinds of features and more so as we move up the layers.
Good 2. aka Capital goods. Useful primarily for other circuits. A good example is the kind of writing to subspaces in the IOI circuits by duplicate token heads. “John” appeared twice as markup on a token / vector in the subspace of a token in the residual stream” doesn’t in itself tell you that Jane is the next token, but is useful to another head which is going to propose a head via another function.
Alternatively, in Neel’s modular arithmetic, calculating waves of terms like sin(wx), cos(wx) which are only useful when you have the rest of the mechanism to get argmax(z) of cos(w(x+y))cos(wz)+sin(w(x+y))sin(wz)=cos(w(x+y−z)).
I would have guess that features in the first category and later in the second, since how would you get gradients to things that aren’t useful yet. However, the existence of clear examples of “internal signals” is somewhat undisputable?
It seems plausible that there are lots of stuff features that sit in both these categories of course so if it’s useful you could define them to be more mutually exclusive and a third category for both.
I realise that my saying “Maybe this is the only kind of good in which case transformers would be “fundamentally interpretable” in some sense. All intermediate signals could be interpreted as final products.” was way too extreme. What I mean is that maybe category two is more less common that we think.
To relate this to AVEC, (which I don’t have a detailed understanding of how you are implementing currently) if you find the vector (I assume residual stream vector) itself has a high dot product with specific unembeddings then that says you’re looking at something in category 1. However, if introducing it into the model earlier has a very different effect to introducing it directly before the unembedding then that would suggest it’s also being used by other modular circuits in the model.
I think this kind of distinction is only one part of what I was trying to get at with circuit economics but hopefully that’s clearer! Sorry for the long explanation and initial confusion.
Happy to provide! I think I’m pretty interested in testing this/working on this in the future. Currently a bit tied up but I think (as Alex hints at) there could be some big implications for interpretability here.
TLDR: Documenting existing circuits is good but explaining what relationship circuits have to each other within the model, such as by understanding how the model allocated limited resources such as residual stream and weights between different learnable circuit seems important.
The general topic I think we are getting at is something like “circuit economics”. The thing I’m trying to gesture at is that while circuits might deliver value in distinct ways (such as reducing loss on different inputs, activating on distinct patterns), they share capacity in weights (see polysemantic and capacity in neural networks) and I guess “bandwidth” (getting penalized for interfering signals in activations). There are a few reasons why I think this feels like economics which include: scarce resources, value chains (features composed of other features) and competition (if a circuit is predicting something well with one heuristic, maybe there will be smaller gradient updates to encourage another circuit learning a different heuristic to emerge).
So to tie this back to your post and Alex’s comment “which seems like it would cut away exponentially many virtual heads? That would be awfully convenient for interpretability.”. I think that what interpretability has recently dealt with in elucidating specific circuits is something like “micro-interpretability” and is akin to microeconomics. However this post seems to show a larger trend ie “macro-interpretability” which would possibly affect which of such circuits are possible/likely to be in the final model.
I’ll elaborate briefly on the off chance this seems like it might be a useful analogy/framing to motivate further work.
Studying the Capacity/Loss Reduction distribution in Time: It seems like during transformer training there may be an effect not unlike inflation? Circuits which delivered enough value to justify their capacity use early in training may fall below the capacity/loss reduction cut off later. Maybe various techniques which enable us to train more robust models work because they make these transitions easier.
Studying the Capacity/Loss Reduction distribution in Layer: Moreover, it seems plausible that the distribution of “usefulness” in circuits in different layers of the network may be far from uniform. Circuits later in the network have far more refined inputs which make them better at reducing loss. Residual stream norm growth seems like a “macro” effect that shows model “know” that later layers are more important.
Studying the Capacity/Loss Reduction distribution in Layer and Time: Combining the above. I’d predict that neural networks originally start by having valuable circuits in many layers but then transition to maintain circuits earlier in the network which are valuable to many downstream circuits and circuits later in the network which make the best use of earlier circuits.
More generally “circuit economics” as a framing seems to suggest that there are different types of “goods” in the transformer economy. those which directly lead to better predictions and those which are useful for making better predictions when integrated with other features. The success of Logit Lens seems to suggest that the latter category increases over the course of the layers. Maybe this is the only kind of good in which case transformers would be “fundamentally interpretable” in some sense. All intermediate signals could be interpreted as final products. More likely, I think is that later in training there are ways to reinforce the creation of more internal goods (in economics, good which are used to make other goods are called capital goods). The value of such goods would be mediated via later circuits. So this would lead also to the “deletion-by-magnitude theory” as a way or removing internal goods.
To bring this back to language already in the field see Neel’s discussion here. A modular circuit is distinct from an end-end circuit in that it starts and ends in intermediate activations. Modular circuits may be composable. I propose that the outputs of such circuits are “capital goods”. If we think about the “circuit economy” it then seems totally reasonable that multiple suppliers might generate equivalent capital goods and have a many to many relationship multiple different circuits near the end voting on logits.
This is very speculative “theory” if you can call it that, but I guess I feel this would be “big if true”. I also make no claims about this being super original or actually that useful in practice but it does feel intuition generating. I think this is totally the kind of thing people might have worked on sooner but it’s likely been historically hard to measure the kinds of things that might be relevant. What your post shows is that between the transformer circuits framework and TransformerLens we are able to somewhat quickly take a bunch of interesting measurements relatively quickly which may provide more traction on this than previously possible.
I read TurnTrout’s summary, of this plan, so this may be entirely unrelated, but the recent paper Generalizing Backpropagation for Gradient-Based Interpretability (video) seems like a good tool for this brand of interpretability work. May want to reach out to the authors to prove the viability of your paradigm and their methods, or just use their methods directly.
Can you say more on this point? The latter kind of good (useful when integrated with other features) doesn’t necessarily imply that direct unembed (logit lens) or learned linear unembed (tuned lens iirc) would be able to extract use from such goods. I suspect that I probably just missed your point, though.
Sure, I could have phrased myself better and I meant to say “former”, which didn’t help either!
Neither of these are novel concepts in that existing investigations have described features of this nature.
Good 1 aka Consumer goods. Useful for unembed (may / may not be useful for other modular circuits inside the network. That Logit Lens gets better over the course of the circuit suggests the residual stream contains these kinds of features and more so as we move up the layers.
Good 2. aka Capital goods. Useful primarily for other circuits. A good example is the kind of writing to subspaces in the IOI circuits by duplicate token heads. “John” appeared twice as markup on a token / vector in the subspace of a token in the residual stream” doesn’t in itself tell you that Jane is the next token, but is useful to another head which is going to propose a head via another function.
Alternatively, in Neel’s modular arithmetic, calculating waves of terms like sin(wx), cos(wx) which are only useful when you have the rest of the mechanism to get argmax(z) of
cos(w(x+y))cos(wz)+sin(w(x+y))sin(wz)=cos(w(x+y−z)).
I would have guess that features in the first category and later in the second, since how would you get gradients to things that aren’t useful yet. However, the existence of clear examples of “internal signals” is somewhat undisputable?
It seems plausible that there are lots of stuff features that sit in both these categories of course so if it’s useful you could define them to be more mutually exclusive and a third category for both.
I realise that my saying “Maybe this is the only kind of good in which case transformers would be “fundamentally interpretable” in some sense. All intermediate signals could be interpreted as final products.” was way too extreme. What I mean is that maybe category two is more less common that we think.
To relate this to AVEC, (which I don’t have a detailed understanding of how you are implementing currently) if you find the vector (I assume residual stream vector) itself has a high dot product with specific unembeddings then that says you’re looking at something in category 1. However, if introducing it into the model earlier has a very different effect to introducing it directly before the unembedding then that would suggest it’s also being used by other modular circuits in the model.
I think this kind of distinction is only one part of what I was trying to get at with circuit economics but hopefully that’s clearer! Sorry for the long explanation and initial confusion.