michael_mjd comments on New OpenAI Paper—Language models can explain neurons in language models

michael_mjd 11 May 2023 2:42 UTC
14 points
4
This might be a good time for me to ask a basic question on mechanistic interpretability:

Why does targeting single neurons work? Does it work? One would think that if there is a single dimensional quantity to measure, why would it align with the standard basis? Why wouldn’t it be aligned to a random one dimensional linear subspace? Then, examining single neurons is likely to give you some weighted combination of concepts instead, rather than a single interpretation...
- Adele Lopez 11 May 2023 2:56 UTC
  11 points
  3
  Parent
  Those are good questions! There’s some existing research which address some of your questions.
  
  Single neurons often do represent multiple concepts: https://transformer-circuits.pub/2022/toy_model/index.html
  
  It seems to still be unclear why the dimensions are aligned with the standard basis: https://transformer-circuits.pub/2023/privileged-basis/index.html
- Ben Amitay 11 May 2023 4:50 UTC
  8 points
  2
  Parent
  It’s not a full answer, but: To the degree that it is true that the quantities align with the standard basis, it must be somehow a result of asymmetry of the activation. For example ReLU trivially depend on the choice of basis.
  
  If you focus on the ReLU example, it sort of make sense: if multiple non-related concepts express in the same neuron, and one of them push the neuron in the negative direction, it may make the ReLU destroy information of the other concepts.