Aryaman Arora comments on A Walkthrough of Interpretability in the Wild (w/ authors Kevin Wang, Arthur Conmy & Alexandre Variengien)

Aryaman Arora 8 Nov 2022 20:20 UTC
LW: 3 AF: 1
0
AF
I’m pretty sure! I don’t think I messed up anywhere in my code (just nested for loop lol). An interesting consequence of this is that for GPT-2, applying logit lens to the embedding matrix (i.e. $softmax (W_{E} W_{U}) = softmax (W_{E} W_{E}^{T})$ ) gives us a near-perfect autoencoder (the top output is the token fed in itself), but for GPT-Neo it always gets us the vector with the largest magnitude since in the dot product $x \cdot y = ∥ x ∥ ∥ y ∥ cos (θ)$ the cosine similarity is a useless term.

What do you mean about MLP0 being basically part of the embed btw? There is no MLP before the first attention layer right?
- Neel Nanda 8 Nov 2022 22:13 UTC
  LW: 3 AF: 1
  0
  AF Parent
  See my other comment—it turns out to be the boring fact that there’s a large constant offset in the GPT-Neo embeddings. If you subtract the mean of the GPT-Neo embed it looks normal. (Though the fact that this exists is interesting! I wonder what that direction is used for?)
  
  What do you mean about MLP0 being basically part of the embed btw? There is no MLP before the first attention layer right?
  
  I mean that, as far as I can tell (medium confidence) attn0 in GPT-2 isn’t used for much, and MLP0 contains most of the information about the value of the token at each position. Eg, ablating MLP0 completely kills performance, while ablating other MLPs doesn’t. And generally the kind of tasks that I’d expect to depend on tokens depend substantially on MLP0
  - Aryaman Arora 9 Nov 2022 0:32 UTC
    1 point
    0
    AF Parent
    Cool that you figured that out, easily explains the high cosine similarity! It does seem to me that a large constant offset to all the embeddings is interesting, since that means GPT-Neo’s later layers have to do computation taking that into account, which seems not at all like an efficient decision. I will def poke around more.
    
    Interesting on MLP0 (I swear I use zero indexing lol just got momentarily confused)! Does that hold across the different GPT sizes?
    - Neel Nanda 9 Nov 2022 13:34 UTC
      LW: 2 AF: 1
      0
      AF Parent
      Haven’t checked lol

Aryaman Arora comments on A Walkthrough of Interpretability in the Wild (w/​ authors Kevin Wang, Arthur Conmy & Alexandre Variengien)

Aryaman Arora comments on A Walkthrough of Interpretability in the Wild (w/ authors Kevin Wang, Arthur Conmy & Alexandre Variengien)