The total logit diff between “a” and “an” contributed by layer 31 seems to be ~1.5 based on your logit lens figure, but your neuron only contributes ~0.4 -- do you have a sense of how exactly the remaining 1.1 is divided?
What’s going on with the negative logit lens layers?
Is there a reason you focus on output congruence as opposed to cosine similarity (which is just normalized congruence)? Intuitively, it seems like the scale of the output vector of an MLP neuron can relatively arbitrary (only constrained to some extent by weight decay), since you can always scale the input instead. Do you expect the results to be different if you used that metric instead?
(I’m guessing not because of the activation patching experiment)
The total logit diff between “a” and “an” contributed by layer 31 seems to be ~1.5 based on your logit lens figure, but your neuron only contributes ~0.4 -- do you have a sense of how exactly the remaining 1.1 is divided?
Sorry we didn’t explain what the scales are on the figures! I’ve added a clarification in the post. The first graph is the absolute logit difference between " a" and " an". For each of the activation patching figures, the metric shown is the relative logit recovery:
So 1 means the patch recovered the same logit diff as the clean prompt, 0 means patch didn’t change the corrupted prompt’s logit diff, <0 means the patch made the logit diff worse than the corrupted prompt etc.
We can see from the MLP layer patching figure that patching MLP 31 recovers 49% of the performance of the clean prompts (you can see the exact numbers on the interactive figures in the linkpost). And from the neuron patching figure we see that patching just Neuron 892 recovers 50% of the clean prompt performance, so actually the rest of the layer is entirely unhelpful.
The next question might be: “Why does patching MLP 31 only recover 49% of the performance when the logit lens makes it look like it’s doing all the work?” I’m not sure what the answer to this is but I also don’t think it’s particularly surprising. It may be that when running the corrupted activation, MLP 33 adds a bunch of the " a" logit to the residual, which patching MLP 31 doesn’t change very much.
What’s going on with the negative logit lens layers?
I think this just means that for the first 30 layers the model moves towards " a" being a better guess than " an". I expect a lot of the computation in these layers is working out that an indefinite article is required, of which " a" is more likely a priori. Only at layer 31 does it realize that " an" is actually more appropriate in this situation than " a".
Is there a reason you focus on output congruence as opposed to cosine similarity (which is just normalized congruence)? Intuitively, it seems like the scale of the output vector of an MLP neuron can relatively arbitrary (only constrained to some extent by weight decay), since you can always scale the input instead. Do you expect the results to be different if you used that metric instead?
It seems to me like using cosine similarity could give different and misleading results. Imagine if " an" pointed in the exact same direction (cosine similarity =1) as two neurons. If one of the two neurons has magnitude 100× bigger than the other, then it will have 100× more impact on the " an" logit.
I don’t understand what you mean by “you can always scale the input instead”. And as input to the MLP is the LayerNorm of the residual up that point the magnitude of the input is always the same.
I don’t understand what you mean by “you can always scale the input instead”. And as input to the MLP is the LayerNorm of the residual up that point the magnitude of the input is always the same.
I might be misremembering the GPT2 architecture, but I thought the output of the MLP layer was something like WO⋅ReLU(Win⋅LN(xresidual))? So you can just scale Win up when you scale WOdown. (Assuming I’m remembering the architecture correctly,) if you’re concerned about the scale of the output, then I think it makes sense to look at the scale of Win as well.
We took dot product over cosine similarity because the dot product is the neuron’s effect on the logits (since we use the dot product of the residual stream and embedding matrix when unembedding).
I think your point on using the scale Win if we are concerned about the scale of Wout is fair — we didn’t really look at how the rest of the network interacted with this neuron through its input weights, but perhaps a input-scaled congruence score (e.g. output congruence * average of squared input weights) could give us a better representation of a neuron’s relevance for a token.
I do agree that looking at WO alone seems a bit misguided (unless we’re normalizing by looking at cosine similarity instead of dot product). However, the extent to which this is true is a bit unclear. Here are a few considerations:
At first blush, the thing you said is exactly right; scaling Win up and scale WO down will leave the implemented function unchanged.
However, this’ll affect the L2 regularization penalty. All else equal, we’d expect to see ∥Win∥=∥WO∥, since that minimizes the regularization penalty.
However, this is all complicated by the fact that you can also alternatively scale the LayerNorm’s gain parameter, which (I think) isn’t regularized.
Lastly, I believe GPT2 uses GELU, not ReLU? This is significant, since it no longer allows you to scale Win and WO without changing the implemented function.
Cool results!
A few questions:
The total logit diff between “a” and “an” contributed by layer 31 seems to be ~1.5 based on your logit lens figure, but your neuron only contributes ~0.4 -- do you have a sense of how exactly the remaining 1.1 is divided?
What’s going on with the negative logit lens layers?
Is there a reason you focus on output congruence as opposed to cosine similarity (which is just normalized congruence)? Intuitively, it seems like the scale of the output vector of an MLP neuron can relatively arbitrary (only constrained to some extent by weight decay), since you can always scale the input instead. Do you expect the results to be different if you used that metric instead?
(I’m guessing not because of the activation patching experiment)
Sorry we didn’t explain what the scales are on the figures! I’ve added a clarification in the post. The first graph is the absolute logit difference between
" a"
and" an"
. For each of the activation patching figures, the metric shown is the relative logit recovery:PatchedLogitDiff−CorruptedLogitDiffCleanLogitDiff−CorruptedLogitDiff
So 1 means the patch recovered the same logit diff as the clean prompt, 0 means patch didn’t change the corrupted prompt’s logit diff, <0 means the patch made the logit diff worse than the corrupted prompt etc.
We can see from the MLP layer patching figure that patching MLP 31 recovers 49% of the performance of the clean prompts (you can see the exact numbers on the interactive figures in the linkpost). And from the neuron patching figure we see that patching just Neuron 892 recovers 50% of the clean prompt performance, so actually the rest of the layer is entirely unhelpful.
The next question might be: “Why does patching MLP 31 only recover 49% of the performance when the logit lens makes it look like it’s doing all the work?” I’m not sure what the answer to this is but I also don’t think it’s particularly surprising. It may be that when running the corrupted activation, MLP 33 adds a bunch of the
" a"
logit to the residual, which patching MLP 31 doesn’t change very much.I think this just means that for the first 30 layers the model moves towards
" a"
being a better guess than" an"
. I expect a lot of the computation in these layers is working out that an indefinite article is required, of which" a"
is more likely a priori. Only at layer 31 does it realize that" an"
is actually more appropriate in this situation than" a"
.It seems to me like using cosine similarity could give different and misleading results. Imagine if
" an"
pointed in the exact same direction (cosine similarity =1) as two neurons. If one of the two neurons has magnitude 100× bigger than the other, then it will have 100× more impact on the" an"
logit.I don’t understand what you mean by “you can always scale the input instead”. And as input to the MLP is the LayerNorm of the residual up that point the magnitude of the input is always the same.
Thanks for clarifying the scales!
I might be misremembering the GPT2 architecture, but I thought the output of the MLP layer was something like WO⋅ReLU(Win⋅LN(xresidual))? So you can just scale Win up when you scale WOdown. (Assuming I’m remembering the architecture correctly,) if you’re concerned about the scale of the output, then I think it makes sense to look at the scale of Win as well.
We took dot product over cosine similarity because the dot product is the neuron’s effect on the logits (since we use the dot product of the residual stream and embedding matrix when unembedding).
I think your point on using the scale Win if we are concerned about the scale of Wout is fair — we didn’t really look at how the rest of the network interacted with this neuron through its input weights, but perhaps a input-scaled congruence score (e.g. output congruence * average of squared input weights) could give us a better representation of a neuron’s relevance for a token.
I do agree that looking at WO alone seems a bit misguided (unless we’re normalizing by looking at cosine similarity instead of dot product). However, the extent to which this is true is a bit unclear. Here are a few considerations:
At first blush, the thing you said is exactly right; scaling Win up and scale WO down will leave the implemented function unchanged.
However, this’ll affect the L2 regularization penalty. All else equal, we’d expect to see ∥Win∥=∥WO∥, since that minimizes the regularization penalty.
However, this is all complicated by the fact that you can also alternatively scale the LayerNorm’s gain parameter, which (I think) isn’t regularized.
Lastly, I believe GPT2 uses GELU, not ReLU? This is significant, since it no longer allows you to scale Win and WO without changing the implemented function.