Joseph Miller comments on We Found An Neuron in GPT-2

Joseph Miller 13 Feb 2023 19:38 UTC
2 points
0
This seems all correct to me except possibly this:
So, artificially increasing W_in’s neurons to eg 100 should cause the same token to be predicted regardless of the prompt
W_in is the input weights for each neuron. So you could increase the activation of the " an" neuron by multiplying the input weights of that neuron by 100. (ie. $W_{in} . T [892] *= 100$ .)
And if you increase the " an" neuron’s activation you will increase " an"’s logit. Our data suggests that if the activation is $> 10$ then it will almost always be the top prediction.
If the neuron activation is relatively very high, then this swamps the direction of your activations
I think this is true but not necessarily relevant. On the one hand, this neuron’s activation will increase the logit of " an" regardless of what the other activations are. On the other hand if the other activations are high then this may reduce the probability of " an" by either increasing other logits or activating other neurons in later layers that output the opposite direction to " an" to the residual stream.