Twitter: _clementneo
Site: clementneo.com
Clement Neo
Karma: 183
Analysing Adversarial Attacks with Linear Probing
Sparse autoencoders find composed features in small toy models
Multi-Agent Security Hackathon
We took dot product over cosine similarity because the dot product is the neuron’s effect on the logits (since we use the dot product of the residual stream and embedding matrix when unembedding).
I think your point on using the scale if we are concerned about the scale of is fair — we didn’t really look at how the rest of the network interacted with this neuron through its input weights, but perhaps a input-scaled congruence score (e.g. output congruence * average of squared input weights) could give us a better representation of a neuron’s relevance for a token.
The prompt was in a style similar to the [Interpretability In The Wild](https://arxiv.org/abs/2211.00593) paper, where one token (′ an’) would be the top answer for the pre-patched prompt — the one with ‘apple’, and the other token (′ a’) would be the the top answer for the patched prompt — the one with ‘lemon’. The idea is that with these prompts is that we know that the top prediction is either ′ an’ or ′ a’, and we can measure the effect of each individual part of the model by seeing how much patching that part of the model sways the prediction towards the ′ a’ token.
To be clear, this can only tell us the significance of this neuron in this particular prompt, which is why we also tried to look at the behaviour of this neuron through other perspectives — which was looking at its activation over a larger, diverse dataset, and looking at its output weights.