evhub comments on interpreting GPT: the logit lens

evhub 1 Sep 2020 22:34 UTC
LW: 3 AF: 2
AF

That’s a great idea!

Thanks! I’d be quite excited to know what you find if you end up trying it.

Hmm… I guess there is some reason to think the basis elements have special meaning (as opposed to the elements of any other basis for the same space), since the layer norm step operates in this basis.

But I doubt there are actually individual components the embedding cares little about, as that seems wasteful (you want to compress 50K into 1600 as well as you possibly can), and if the embedding cares about them even a little bit then the model needs to slot in the appropriate predictive information, eventually.

Thinking out loud, I imagine there might be pattern where embeddings of unlikely tokens (given the context) are repurposed in the middle for computation (you know they’re near-impossible so you don’t need to track them closely), and then smoothly subtracted out at the end. There’s probably a way to check if that’s happening.

I wasn’t thinking you would do this with the natural component basis—though it’s probably worth trying that also—but rather doing some sort of matrix decomposition on the embedding matrix to get a basis ordered by importance (e.g. using PCA or NMF—PCA is simpler though I know NMF is what OpenAI Clarity usually uses when they’re trying to extract interpretable basis elements from neural network activations) and then seeing what the linear model looks like in that basis. You could even just do something like what you’re saying and find some sort of basis ordered by the frequency of the tokens that each basis element corresponds to (though I’m not sure exactly what the right way would be to generate such a basis).
- nostalgebraist 2 Sep 2020 0:02 UTC
  LW: 5 AF: 3
  AF Parent
  I also thought of PCA/SVD, but I imagine matrix decompositions like these would be misleading here.
  What matters here (I think) is not some basis of N_emb orthogonal vectors in embedding space, but some much larger set of ~exp(N_emb) almost orthogonal vectors. We only have 1600 degrees of freedom to tune, but they’re continuous degrees of freedom, and this lets us express >>1600 distinct vectors in vocab space as long as we accept some small amount of reconstruction error.
  I expect GPT and many other neural models are effectively working in such space of nearly orthogonal vectors, and picking/combining elements of it. A decomposition into orthogonal vectors won’t really illuminate this. I wish I knew more about this topic—are there standard techniques?
  - Vlad Mikulik 2 Sep 2020 15:42 UTC
    LW: 5 AF: 3
    AF Parent
    You might want to look into NMF, which, unlike PCA/SVD, doesn’t aim to create an orthogonal projection. It works well for interpretability because its components cannot cancel each other out, which makes its features more intuitive to reason about. I think it is essentially what you want, although I don’t think it will allow you to find directly the ‘larger set of almost orthogonal vectors’ you’re looking for.