Clement Neo

Karma: 177

Twitter: _clementneo
Site: clementneo.com

Analysing Adversarial Attacks with Linear Probing

Yoann Poupart, Imene Kerboua, Clement Neo and Jason Hoelscher-Obermaier

17 Jun 2024 14:16 UTC

9 points

0 comments8 min readLW link

Sparse autoencoders find composed features in small toy models

Evan Anders, Clement Neo, Jason Hoelscher-Obermaier and Jessica N. Howard

14 Mar 2024 18:00 UTC

27 points

12 comments15 min readLW link

Multi-Agent Security Hackathon

Esben Kran, Jason Hoelscher-Obermaier and Clement Neo

5 Feb 2024 22:51 UTC

6 points

0 comments1 min readLW link

Clement Neo 13 Feb 2023 17:54 UTC
2 points
0
in reply to: scasper’s comment on: We Found An Neuron in GPT-2
The prompt was in a style similar to the [Interpretability In The Wild](https://arxiv.org/abs/2211.00593) paper, where one token (′ an’) would be the top answer for the pre-patched prompt — the one with ‘apple’, and the other token (′ a’) would be the the top answer for the patched prompt — the one with ‘lemon’. The idea is that with these prompts is that we know that the top prediction is either ′ an’ or ′ a’, and we can measure the effect of each individual part of the model by seeing how much patching that part of the model sways the prediction towards the ′ a’ token.

To be clear, this can only tell us the significance of this neuron in this particular prompt, which is why we also tried to look at the behaviour of this neuron through other perspectives — which was looking at its activation over a larger, diverse dataset, and looking at its output weights.

Clement Neo 12 Feb 2023 10:57 UTC
6 points
1
in reply to: LawrenceC’s comment on: We Found An Neuron in GPT-2
We took dot product over cosine similarity because the dot product is the neuron’s effect on the logits (since we use the dot product of the residual stream and embedding matrix when unembedding).

I think your point on using the scale $W_{i n}$ if we are concerned about the scale of $W_{o u t}$ is fair — we didn’t really look at how the rest of the network interacted with this neuron through its input weights, but perhaps a input-scaled congruence score (e.g. output congruence * average of squared input weights) could give us a better representation of a neuron’s relevance for a token.

We Found An Neuron in GPT-2

Joseph Miller and Clement Neo

11 Feb 2023 18:27 UTC

143 points

23 comments7 min readLW link

(clementneo.com)