Features as neurons is the more specific hypothesis that, not only do features correspond to directions, but that each neuron corresponds to a feature, and that the neuron’s activation is the strength of that feature on that input.
Shouldn’t it be “each feature corresponds to a neuron” rather than “each neuron corresponds to a feature”?
Because some could be just calculations to get to a higher-level features (part of a circuit).
Because some could be just calculations to get to a higher-level features (part of a circuit).
IMO, the intermediate steps should mostly be counted as features in their own right, but it’d depend on the circuit. The main reason I agree is that neurons probably still do some other stuff, eg memory management or signal boosting earlier directions in the residual stream.
Thanks for writing this. A question:
Shouldn’t it be “each feature corresponds to a neuron” rather than “each neuron corresponds to a feature”?
Because some could be just calculations to get to a higher-level features (part of a circuit).
Fair point, corrected.
IMO, the intermediate steps should mostly be counted as features in their own right, but it’d depend on the circuit. The main reason I agree is that neurons probably still do some other stuff, eg memory management or signal boosting earlier directions in the residual stream.