This might be a good time for me to ask a basic question on mechanistic interpretability:
Why does targeting single neurons work? Does it work? One would think that if there is a single dimensional quantity to measure, why would it align with the standard basis? Why wouldn’t it be aligned to a random one dimensional linear subspace? Then, examining single neurons is likely to give you some weighted combination of concepts instead, rather than a single interpretation...
It’s not a full answer, but:
To the degree that it is true that the quantities align with the standard basis, it must be somehow a result of asymmetry of the activation. For example ReLU trivially depend on the choice of basis.
If you focus on the ReLU example, it sort of make sense: if multiple non-related concepts express in the same neuron, and one of them push the neuron in the negative direction, it may make the ReLU destroy information of the other concepts.
This might be a good time for me to ask a basic question on mechanistic interpretability:
Why does targeting single neurons work? Does it work? One would think that if there is a single dimensional quantity to measure, why would it align with the standard basis? Why wouldn’t it be aligned to a random one dimensional linear subspace? Then, examining single neurons is likely to give you some weighted combination of concepts instead, rather than a single interpretation...
Those are good questions! There’s some existing research which address some of your questions.
Single neurons often do represent multiple concepts: https://transformer-circuits.pub/2022/toy_model/index.html
It seems to still be unclear why the dimensions are aligned with the standard basis: https://transformer-circuits.pub/2023/privileged-basis/index.html
It’s not a full answer, but: To the degree that it is true that the quantities align with the standard basis, it must be somehow a result of asymmetry of the activation. For example ReLU trivially depend on the choice of basis.
If you focus on the ReLU example, it sort of make sense: if multiple non-related concepts express in the same neuron, and one of them push the neuron in the negative direction, it may make the ReLU destroy information of the other concepts.