CharlotteS comments on A Comprehensive Mechanistic Interpretability Explainer & Glossary

CharlotteS 15 Jan 2023 22:34 UTC
1 point
0
Thanks for writing this. A question:
Features as neurons is the more specific hypothesis that, not only do features correspond to directions, but that each neuron corresponds to a feature, and that the neuron’s activation is the strength of that feature on that input.
Shouldn’t it be “each feature corresponds to a neuron” rather than “each neuron corresponds to a feature”?
Because some could be just calculations to get to a higher-level features (part of a circuit).
- Neel Nanda 16 Jan 2023 16:26 UTC
  2 points
  0
  Parent
  Fair point, corrected.
  
  Because some could be just calculations to get to a higher-level features (part of a circuit).
  
  IMO, the intermediate steps should mostly be counted as features in their own right, but it’d depend on the circuit. The main reason I agree is that neurons probably still do some other stuff, eg memory management or signal boosting earlier directions in the residual stream.