Have you looked at all into what parts of the model feed into (some of) the cleanly associated neurons? It was probably out of scope for this but just curious.
We did look very briefly at this for the " an" neuron. We plotted the residual stream congruence with the neuron input weights throughout the model. The second figure shows the difference from each layer.
Unfortunately I can’t seem to comment an image. See it here.
We can’t tell that much from this but I think there are three takeaways:
The model doesn’t start ‘preparing’ to activate the " an" neuron until layer 16.
No single layer stands out a lot as being particularly responsible for the " an" neuron’s activation (which is part of why we didn’t investigate this further).
The congruence increases a lot after MLP 31. This means the output of layer 31 is very congruent with the input weights of the " an" neuron (which is in MLP 31). I this this is almost entirely the effect of the " an" neuron, partly because the input of the " an" neuron is very congruent with the " an" token (although not as much as the neuron output weights). This makes me think that this neuron is at least partly a ‘signal boosting’ neuron.
We did look very briefly at this for the
" an"
neuron. We plotted the residual stream congruence with the neuron input weights throughout the model. The second figure shows the difference from each layer.Unfortunately I can’t seem to comment an image. See it here.
We can’t tell that much from this but I think there are three takeaways:
The model doesn’t start ‘preparing’ to activate the
" an"
neuron until layer 16.No single layer stands out a lot as being particularly responsible for the
" an"
neuron’s activation (which is part of why we didn’t investigate this further).The congruence increases a lot after MLP 31. This means the output of layer 31 is very congruent with the input weights of the
" an"
neuron (which is in MLP 31). I this this is almost entirely the effect of the" an"
neuron, partly because the input of the" an"
neuron is very congruent with the" an"
token (although not as much as the neuron output weights). This makes me think that this neuron is at least partly a ‘signal boosting’ neuron.