A little while ago I made a post speculating about some of the high-level structure of GPT-XL (side note: very satisfying to see info like this being dug out so clearly here). One of the weird things about GPT-XL is that it seems to focus a disproportionate amount of attention on the first token—except in a consistent chunk of the early layers (layers 1 − 8 for XL) and the very last layers.
Do you know if there is a similar pattern of a chunk of early layers in GPT-medium having much more evenly distributed attention than the middle layers of the network? If so, is the transition out of ‘early distributed attention’ associated with changes in the character of the SVD directions of the attention OV circuits / MLPs?
I suspect that this ‘early distributed attention’ might be helping out with tasks like building multiply-tokenised words or figuring out syntax in GPT-XL. It would be quite nice if in GPT-medium the same early layers that have MLP SVD directions that seem associated with these kinds of tasks are also those that display more evenly distributed attention.
(Also, in terms of comparing the fraction of interpretable directions in MLPs per block across the different GPT sizes—I think it is interesting to consider the similarities when the x-axis is “fraction of layers through” instead of raw layer number. One potential (noisy) pattern here is that the models seem to have a rise and dip in the fraction of directions interpretable in MLPs in the first half of the network, followed by a second rise and dip in the latter half of the network.)
This seems like a super interesting result! Thanks for linking; I wasn’t aware of it. I haven’t specifically looked for this pattern in GPT2-medium but I will now! Interestingly, we have also been thinking along similar lines of a 3 phase sequence for processing in residual nets like transformers where the first few layers do some kind of ‘large scale’ reshaping process of the geometry of the data while the later layers mostly do some kind of smaller refinements which don’t change the basic geometry of the representation much, and then the final layer does one massive map to output space. This becomes quite obvious if you look at the cosine similarities of the residual stream between blocks. I hadn’t made the link with the potential attention patterns being more widely distributed at earlier layers though.
I suspect that this ‘early distributed attention’ might be helping out with tasks like building multiply-tokenised words or figuring out syntax in GPT-XL. It would be quite nice if in GPT-medium the same early layers that have MLP SVD directions that seem associated with these kinds of tasks are also those that display more evenly distributed attention.
This would be easy to look at and we might see something potentially in the OV circuits. A general downside of this method is that I have never had any success with applying it to the QK circuits, and I think it’s because the attention is often performing syntactic instead of semantic operations and so projecting to embedding space is meaningless. I agree with the qualitative assessment that the early attention blocks are probably doing a lot of basic syntax/detokenization tasks like this although I don’t have a good sense of whether the MLPs are also doing this or some other kind of simple semantic processing.
(Also, in terms of comparing the fraction of interpretable directions in MLPs per block across the different GPT sizes—I think it is interesting to consider the similarities when the x-axis is “fraction of layers through” instead of raw layer number. One potential (noisy) pattern here is that the models seem to have a rise and dip in the fraction of directions interpretable in MLPs in the first half of the network, followed by a second rise and dip in the latter half of the network.)
I am pretty sure I made plots for this (there are definitely comparable plots in the colab already but in terms of absolute layer number instead of fraction so you will have to ‘imagine’ stretching them out. I agree there is an interesting seeming noisy pattern here. My feeling is that the early dip is probably noise and I am not sure about the later one. Definitely a lot of the time when I have qualitatively observed the final layer, the directions often suddenly become weird or meaningless in the final block.
This is great!
A little while ago I made a post speculating about some of the high-level structure of GPT-XL (side note: very satisfying to see info like this being dug out so clearly here). One of the weird things about GPT-XL is that it seems to focus a disproportionate amount of attention on the first token—except in a consistent chunk of the early layers (layers 1 − 8 for XL) and the very last layers.
Do you know if there is a similar pattern of a chunk of early layers in GPT-medium having much more evenly distributed attention than the middle layers of the network? If so, is the transition out of ‘early distributed attention’ associated with changes in the character of the SVD directions of the attention OV circuits / MLPs?
I suspect that this ‘early distributed attention’ might be helping out with tasks like building multiply-tokenised words or figuring out syntax in GPT-XL. It would be quite nice if in GPT-medium the same early layers that have MLP SVD directions that seem associated with these kinds of tasks are also those that display more evenly distributed attention.
(Also, in terms of comparing the fraction of interpretable directions in MLPs per block across the different GPT sizes—I think it is interesting to consider the similarities when the x-axis is “fraction of layers through” instead of raw layer number. One potential (noisy) pattern here is that the models seem to have a rise and dip in the fraction of directions interpretable in MLPs in the first half of the network, followed by a second rise and dip in the latter half of the network.)
This seems like a super interesting result! Thanks for linking; I wasn’t aware of it. I haven’t specifically looked for this pattern in GPT2-medium but I will now! Interestingly, we have also been thinking along similar lines of a 3 phase sequence for processing in residual nets like transformers where the first few layers do some kind of ‘large scale’ reshaping process of the geometry of the data while the later layers mostly do some kind of smaller refinements which don’t change the basic geometry of the representation much, and then the final layer does one massive map to output space. This becomes quite obvious if you look at the cosine similarities of the residual stream between blocks. I hadn’t made the link with the potential attention patterns being more widely distributed at earlier layers though.
This would be easy to look at and we might see something potentially in the OV circuits. A general downside of this method is that I have never had any success with applying it to the QK circuits, and I think it’s because the attention is often performing syntactic instead of semantic operations and so projecting to embedding space is meaningless. I agree with the qualitative assessment that the early attention blocks are probably doing a lot of basic syntax/detokenization tasks like this although I don’t have a good sense of whether the MLPs are also doing this or some other kind of simple semantic processing.
I am pretty sure I made plots for this (there are definitely comparable plots in the colab already but in terms of absolute layer number instead of fraction so you will have to ‘imagine’ stretching them out. I agree there is an interesting seeming noisy pattern here. My feeling is that the early dip is probably noise and I am not sure about the later one. Definitely a lot of the time when I have qualitatively observed the final layer, the directions often suddenly become weird or meaningless in the final block.