Understand IOI in GPT-Neo: it’s a same size model but does IOI via composition of MLPs
GPT-Neo might be weird because it was trained without dropout iirc. In general, it seems to be a very unusual model compared to others of its size; e.g. logit lens totally fails on it, and probing experiments find most of its later layers add very little information to its logit predictions. Relatedly, I would think dropout is responsible for backup heads existing and taking over if other heads are knocked out.
Honestly I expect that training without dropout makes it notably better. Dropout is fucked! Interesting that you say logit lens fails and later layers don’t matter—can you say more about that?
Arthur mentions something in the walkthrough about how GPT-Neo does seem to have some backup heads, which is wild—I agree that intuitively backup heads should come from dropout.
Huh interesting about the backup heads in GPT-Neo! I would not expect a dropout-less model to have that—some ideas to consider:
the backup heads could have other main functions but incidentally are useful for the specific task we’re looking at, so they end up taking the place of the main heads
thinking of virtual attention heads, the computations performed are not easily interpretable at the individual head-level once you have a lot of layers, sort of like how neurons aren’t interpretable in big models due to superposition
Re: GPT-Neo being weird, one of the colabs in the original logit lens post shows that logit lens is pretty decent for standard GPT-2 of varying sizes but basically useless for GPT-Neo, i.e. outputs some extremely unlikely tokens for every layer before the last one. The bigger GPT-Neos are a bit better (some layers are kinda intepretable with logit lens) but still bad. Basically, the residual stream is just in a totally wacky basis until the last layer’s computations, unlike GPT-2 which shows more stability (the whole reason logit lens works).
One weird thing I noticed with GPT-Neo 125M’s embedding matrix is that the input static embeddings are super concentrated in vector space, avg. pairwise cosine similarity is 0.960 compared to GPT-2 small’s 0.225.
On the later layers not doing much, I saw some discussion on the EleutherAI discord that probes can recover really good logit distributions from the middle layers of the big GPT-Neo models. I haven’t looked into this more myself so I don’t know how it compares to GPT-2. Just seems to be an overall profoundly strange model.
Just dug into it more, the GPT-Neo embed just has a large constant offset. Average norm is 11.4, norm of mean is 11. Avg cosine sim is 0.93 before, after subtracting the mean it’s 0.0024 (avg absolute value of cosine sim is 0.1831)
Wait, WTF? Are you sure? 0.96 is super high. The only explanation I can see for that is a massive constant offset dominating the cosine sim (which isn’t crazy tbh).
The Colab claims that the logit lens doesn’t work for GPT-Neo, but does work if you include the final block, which seems sane to me. I think that in GPT-2 the MLP0 is basically part of the embed, so it doesn’t seem crazy for the inverse to be true (esp if you do the dumb thing of making your embedding + unembedding matrix the same)
I’m pretty sure! I don’t think I messed up anywhere in my code (just nested for loop lol). An interesting consequence of this is that for GPT-2, applying logit lens to the embedding matrix (i.e. softmax(WEWU)=softmax(WEWTE)) gives us a near-perfect autoencoder (the top output is the token fed in itself), but for GPT-Neo it always gets us the vector with the largest magnitude since in the dot product x⋅y=∥x∥∥y∥cos(θ) the cosine similarity is a useless term.
What do you mean about MLP0 being basically part of the embed btw? There is no MLP before the first attention layer right?
See my other comment—it turns out to be the boring fact that there’s a large constant offset in the GPT-Neo embeddings. If you subtract the mean of the GPT-Neo embed it looks normal. (Though the fact that this exists is interesting! I wonder what that direction is used for?)
What do you mean about MLP0 being basically part of the embed btw? There is no MLP before the first attention layer right?
I mean that, as far as I can tell (medium confidence) attn0 in GPT-2 isn’t used for much, and MLP0 contains most of the information about the value of the token at each position. Eg, ablating MLP0 completely kills performance, while ablating other MLPs doesn’t. And generally the kind of tasks that I’d expect to depend on tokens depend substantially on MLP0
Cool that you figured that out, easily explains the high cosine similarity! It does seem to me that a large constant offset to all the embeddings is interesting, since that means GPT-Neo’s later layers have to do computation taking that into account, which seems not at all like an efficient decision. I will def poke around more.
Interesting on MLP0 (I swear I use zero indexing lol just got momentarily confused)! Does that hold across the different GPT sizes?
I have spent some time trying to do mechanistic interp on GPT-Neo, to try and answer whether compensation only occurs because of dropout. TLDR: my current impression is that compensation still occurs in models trained without dropout, just to a lesser extent.
In depth, when GPT-Neo is fed a sequence of tokens t1t2...t10t11t12...t20 where t1,...,t10 are uniformly random and ti=ti−10 for i≥11, there are four heads in Layer 6 that have the induction attention pattern (i.e attend from ti to ti−9). Three of these heads (6.0, 6.6, 6.11) when ablated decrease loss, and one of these heads increases loss on ablation (6.1). Interestingly, when 6.1 is ablated, the additional ablation of 6.0, 6.6 and 6.11 causes loss to increase (perhaps this is confusing, see this table!).
My guess is the model is able to use the outputs of 6.0, 6.6 and 6.11 differently in the two regimes, so they “compensate” when 6.1 is ablated.
GPT-Neo might be weird because it was trained without dropout iirc. In general, it seems to be a very unusual model compared to others of its size; e.g. logit lens totally fails on it, and probing experiments find most of its later layers add very little information to its logit predictions. Relatedly, I would think dropout is responsible for backup heads existing and taking over if other heads are knocked out.
Honestly I expect that training without dropout makes it notably better. Dropout is fucked! Interesting that you say logit lens fails and later layers don’t matter—can you say more about that?
Arthur mentions something in the walkthrough about how GPT-Neo does seem to have some backup heads, which is wild—I agree that intuitively backup heads should come from dropout.
Huh interesting about the backup heads in GPT-Neo! I would not expect a dropout-less model to have that—some ideas to consider:
the backup heads could have other main functions but incidentally are useful for the specific task we’re looking at, so they end up taking the place of the main heads
thinking of virtual attention heads, the computations performed are not easily interpretable at the individual head-level once you have a lot of layers, sort of like how neurons aren’t interpretable in big models due to superposition
Re: GPT-Neo being weird, one of the colabs in the original logit lens post shows that logit lens is pretty decent for standard GPT-2 of varying sizes but basically useless for GPT-Neo, i.e. outputs some extremely unlikely tokens for every layer before the last one. The bigger GPT-Neos are a bit better (some layers are kinda intepretable with logit lens) but still bad. Basically, the residual stream is just in a totally wacky basis until the last layer’s computations, unlike GPT-2 which shows more stability (the whole reason logit lens works).
One weird thing I noticed with GPT-Neo 125M’s embedding matrix is that the input static embeddings are super concentrated in vector space, avg. pairwise cosine similarity is 0.960 compared to GPT-2 small’s 0.225.
On the later layers not doing much, I saw some discussion on the EleutherAI discord that probes can recover really good logit distributions from the middle layers of the big GPT-Neo models. I haven’t looked into this more myself so I don’t know how it compares to GPT-2. Just seems to be an overall profoundly strange model.
Just dug into it more, the GPT-Neo embed just has a large constant offset. Average norm is 11.4, norm of mean is 11. Avg cosine sim is 0.93 before, after subtracting the mean it’s 0.0024 (avg absolute value of cosine sim is 0.1831)
Wait, WTF? Are you sure? 0.96 is super high. The only explanation I can see for that is a massive constant offset dominating the cosine sim (which isn’t crazy tbh).
The Colab claims that the logit lens doesn’t work for GPT-Neo, but does work if you include the final block, which seems sane to me. I think that in GPT-2 the MLP0 is basically part of the embed, so it doesn’t seem crazy for the inverse to be true (esp if you do the dumb thing of making your embedding + unembedding matrix the same)
I’m pretty sure! I don’t think I messed up anywhere in my code (just nested for loop lol). An interesting consequence of this is that for GPT-2, applying logit lens to the embedding matrix (i.e. softmax(WEWU)=softmax(WEWTE)) gives us a near-perfect autoencoder (the top output is the token fed in itself), but for GPT-Neo it always gets us the vector with the largest magnitude since in the dot product x⋅y=∥x∥∥y∥cos(θ) the cosine similarity is a useless term.
What do you mean about MLP0 being basically part of the embed btw? There is no MLP before the first attention layer right?
See my other comment—it turns out to be the boring fact that there’s a large constant offset in the GPT-Neo embeddings. If you subtract the mean of the GPT-Neo embed it looks normal. (Though the fact that this exists is interesting! I wonder what that direction is used for?)
I mean that, as far as I can tell (medium confidence) attn0 in GPT-2 isn’t used for much, and MLP0 contains most of the information about the value of the token at each position. Eg, ablating MLP0 completely kills performance, while ablating other MLPs doesn’t. And generally the kind of tasks that I’d expect to depend on tokens depend substantially on MLP0
Cool that you figured that out, easily explains the high cosine similarity! It does seem to me that a large constant offset to all the embeddings is interesting, since that means GPT-Neo’s later layers have to do computation taking that into account, which seems not at all like an efficient decision. I will def poke around more.
Interesting on MLP0 (I swear I use zero indexing lol just got momentarily confused)! Does that hold across the different GPT sizes?
Haven’t checked lol
Thanks for the comment!
I have spent some time trying to do mechanistic interp on GPT-Neo, to try and answer whether compensation only occurs because of dropout. TLDR: my current impression is that compensation still occurs in models trained without dropout, just to a lesser extent.
In depth, when GPT-Neo is fed a sequence of tokens t1t2...t10t11t12...t20 where t1,...,t10 are uniformly random and ti=ti−10 for i≥11, there are four heads in Layer 6 that have the induction attention pattern (i.e attend from ti to ti−9). Three of these heads (6.0, 6.6, 6.11) when ablated decrease loss, and one of these heads increases loss on ablation (6.1). Interestingly, when 6.1 is ablated, the additional ablation of 6.0, 6.6 and 6.11 causes loss to increase (perhaps this is confusing, see this table!).
My guess is the model is able to use the outputs of 6.0, 6.6 and 6.11 differently in the two regimes, so they “compensate” when 6.1 is ablated.