We took dot product over cosine similarity because the dot product is the neuron’s effect on the logits (since we use the dot product of the residual stream and embedding matrix when unembedding).
I think your point on using the scale Win if we are concerned about the scale of Wout is fair — we didn’t really look at how the rest of the network interacted with this neuron through its input weights, but perhaps a input-scaled congruence score (e.g. output congruence * average of squared input weights) could give us a better representation of a neuron’s relevance for a token.
We took dot product over cosine similarity because the dot product is the neuron’s effect on the logits (since we use the dot product of the residual stream and embedding matrix when unembedding).
I think your point on using the scale Win if we are concerned about the scale of Wout is fair — we didn’t really look at how the rest of the network interacted with this neuron through its input weights, but perhaps a input-scaled congruence score (e.g. output congruence * average of squared input weights) could give us a better representation of a neuron’s relevance for a token.