LawrenceC comments on We Found An Neuron in GPT-2

LawrenceC 11 Feb 2023 19:07 UTC
6 points
0
Cool results!

A few questions:
1. The total logit diff between “a” and “an” contributed by layer 31 seems to be ~1.5 based on your logit lens figure, but your neuron only contributes ~0.4 -- do you have a sense of how exactly the remaining 1.1 is divided?
2. What’s going on with the negative logit lens layers?
3. Is there a reason you focus on output congruence as opposed to cosine similarity (which is just normalized congruence)? Intuitively, it seems like the scale of the output vector of an MLP neuron can relatively arbitrary (only constrained to some extent by weight decay), since you can always scale the input instead. Do you expect the results to be different if you used that metric instead?
  
  (I’m guessing not because of the activation patching experiment)
- Joseph Miller 11 Feb 2023 22:05 UTC
  4 points
  0
  Parent
  The total logit diff between “a” and “an” contributed by layer 31 seems to be ~1.5 based on your logit lens figure, but your neuron only contributes ~0.4 -- do you have a sense of how exactly the remaining 1.1 is divided?
  Sorry we didn’t explain what the scales are on the figures! I’ve added a clarification in the post. The first graph is the absolute logit difference between " a" and " an". For each of the activation patching figures, the metric shown is the relative logit recovery:
  $\frac{PatchedLogitDiff - CorruptedLogitDiff}{CleanLogitDiff - CorruptedLogitDiff}$
  So $1$ means the patch recovered the same logit diff as the clean prompt, $0$ means patch didn’t change the corrupted prompt’s logit diff, $< 0$ means the patch made the logit diff worse than the corrupted prompt etc.
  We can see from the MLP layer patching figure that patching MLP 31 recovers 49% of the performance of the clean prompts (you can see the exact numbers on the interactive figures in the linkpost). And from the neuron patching figure we see that patching just Neuron 892 recovers 50% of the clean prompt performance, so actually the rest of the layer is entirely unhelpful.
  The next question might be: “Why does patching MLP 31 only recover 49% of the performance when the logit lens makes it look like it’s doing all the work?” I’m not sure what the answer to this is but I also don’t think it’s particularly surprising. It may be that when running the corrupted activation, MLP 33 adds a bunch of the " a" logit to the residual, which patching MLP 31 doesn’t change very much.
  What’s going on with the negative logit lens layers?
  I think this just means that for the first 30 layers the model moves towards " a" being a better guess than " an". I expect a lot of the computation in these layers is working out that an indefinite article is required, of which " a" is more likely a priori. Only at layer 31 does it realize that " an" is actually more appropriate in this situation than " a".
  Is there a reason you focus on output congruence as opposed to cosine similarity (which is just normalized congruence)? Intuitively, it seems like the scale of the output vector of an MLP neuron can relatively arbitrary (only constrained to some extent by weight decay), since you can always scale the input instead. Do you expect the results to be different if you used that metric instead?
  It seems to me like using cosine similarity could give different and misleading results. Imagine if " an" pointed in the exact same direction (cosine similarity $= 1$ ) as two neurons. If one of the two neurons has magnitude $100 \times$ bigger than the other, then it will have $100 \times$ more impact on the " an" logit.
  I don’t understand what you mean by “you can always scale the input instead”. And as input to the MLP is the LayerNorm of the residual up that point the magnitude of the input is always the same.
  What links here?
  - We Found An Neuron in GPT-2 by Joseph Miller (11 Feb 2023 18:27 UTC; 143 points)
  - LawrenceC 12 Feb 2023 5:41 UTC
    3 points
    0
    Parent
    Thanks for clarifying the scales!
    I don’t understand what you mean by “you can always scale the input instead”. And as input to the MLP is the LayerNorm of the residual up that point the magnitude of the input is always the same.
    I might be misremembering the GPT2 architecture, but I thought the output of the MLP layer was something like $W_{O} \cdot ReLU (W_{i n} \cdot L N (x_{r e s i d u a l}))$ ? So you can just scale $W_{i n}$ up when you scale $W_{O}$ down. (Assuming I’m remembering the architecture correctly,) if you’re concerned about the scale of the output, then I think it makes sense to look at the scale of $W_{i n}$ as well.
    - Clement Neo 12 Feb 2023 10:57 UTC
      6 points
      1
      Parent
      We took dot product over cosine similarity because the dot product is the neuron’s effect on the logits (since we use the dot product of the residual stream and embedding matrix when unembedding).
      
      I think your point on using the scale $W_{i n}$ if we are concerned about the scale of $W_{o u t}$ is fair — we didn’t really look at how the rest of the network interacted with this neuron through its input weights, but perhaps a input-scaled congruence score (e.g. output congruence * average of squared input weights) could give us a better representation of a neuron’s relevance for a token.
    - abhatt349 8 May 2023 19:18 UTC
      1 point
      0
      Parent
      I do agree that looking at $W_{O}$ alone seems a bit misguided (unless we’re normalizing by looking at cosine similarity instead of dot product). However, the extent to which this is true is a bit unclear. Here are a few considerations:
      At first blush, the thing you said is exactly right; scaling $W_{i n}$ up and scale $W_{O}$ down will leave the implemented function unchanged.
      However, this’ll affect the L2 regularization penalty. All else equal, we’d expect to see $∥ W_{i n} ∥ = ∥ W_{O} ∥$ , since that minimizes the regularization penalty.
      However, this is all complicated by the fact that you can also alternatively scale the LayerNorm’s gain parameter, which (I think) isn’t regularized.
      Lastly, I believe GPT2 uses GELU, not ReLU? This is significant, since it no longer allows you to scale $W_{i n}$ and $W_{O}$ without changing the implemented function.