Wait, WTF? Are you sure? 0.96 is super high. The only explanation I can see for that is a massive constant offset dominating the cosine sim (which isn’t crazy tbh).
The Colab claims that the logit lens doesn’t work for GPT-Neo, but does work if you include the final block, which seems sane to me. I think that in GPT-2 the MLP0 is basically part of the embed, so it doesn’t seem crazy for the inverse to be true (esp if you do the dumb thing of making your embedding + unembedding matrix the same)
I’m pretty sure! I don’t think I messed up anywhere in my code (just nested for loop lol). An interesting consequence of this is that for GPT-2, applying logit lens to the embedding matrix (i.e. softmax(WEWU)=softmax(WEWTE)) gives us a near-perfect autoencoder (the top output is the token fed in itself), but for GPT-Neo it always gets us the vector with the largest magnitude since in the dot product x⋅y=∥x∥∥y∥cos(θ) the cosine similarity is a useless term.
What do you mean about MLP0 being basically part of the embed btw? There is no MLP before the first attention layer right?
See my other comment—it turns out to be the boring fact that there’s a large constant offset in the GPT-Neo embeddings. If you subtract the mean of the GPT-Neo embed it looks normal. (Though the fact that this exists is interesting! I wonder what that direction is used for?)
What do you mean about MLP0 being basically part of the embed btw? There is no MLP before the first attention layer right?
I mean that, as far as I can tell (medium confidence) attn0 in GPT-2 isn’t used for much, and MLP0 contains most of the information about the value of the token at each position. Eg, ablating MLP0 completely kills performance, while ablating other MLPs doesn’t. And generally the kind of tasks that I’d expect to depend on tokens depend substantially on MLP0
Cool that you figured that out, easily explains the high cosine similarity! It does seem to me that a large constant offset to all the embeddings is interesting, since that means GPT-Neo’s later layers have to do computation taking that into account, which seems not at all like an efficient decision. I will def poke around more.
Interesting on MLP0 (I swear I use zero indexing lol just got momentarily confused)! Does that hold across the different GPT sizes?
Wait, WTF? Are you sure? 0.96 is super high. The only explanation I can see for that is a massive constant offset dominating the cosine sim (which isn’t crazy tbh).
The Colab claims that the logit lens doesn’t work for GPT-Neo, but does work if you include the final block, which seems sane to me. I think that in GPT-2 the MLP0 is basically part of the embed, so it doesn’t seem crazy for the inverse to be true (esp if you do the dumb thing of making your embedding + unembedding matrix the same)
I’m pretty sure! I don’t think I messed up anywhere in my code (just nested for loop lol). An interesting consequence of this is that for GPT-2, applying logit lens to the embedding matrix (i.e. softmax(WEWU)=softmax(WEWTE)) gives us a near-perfect autoencoder (the top output is the token fed in itself), but for GPT-Neo it always gets us the vector with the largest magnitude since in the dot product x⋅y=∥x∥∥y∥cos(θ) the cosine similarity is a useless term.
What do you mean about MLP0 being basically part of the embed btw? There is no MLP before the first attention layer right?
See my other comment—it turns out to be the boring fact that there’s a large constant offset in the GPT-Neo embeddings. If you subtract the mean of the GPT-Neo embed it looks normal. (Though the fact that this exists is interesting! I wonder what that direction is used for?)
I mean that, as far as I can tell (medium confidence) attn0 in GPT-2 isn’t used for much, and MLP0 contains most of the information about the value of the token at each position. Eg, ablating MLP0 completely kills performance, while ablating other MLPs doesn’t. And generally the kind of tasks that I’d expect to depend on tokens depend substantially on MLP0
Cool that you figured that out, easily explains the high cosine similarity! It does seem to me that a large constant offset to all the embeddings is interesting, since that means GPT-Neo’s later layers have to do computation taking that into account, which seems not at all like an efficient decision. I will def poke around more.
Interesting on MLP0 (I swear I use zero indexing lol just got momentarily confused)! Does that hold across the different GPT sizes?
Haven’t checked lol