How was the ′ a’ v. ′ an’ selection task selected? It seems quite convenient to probe for and also the kind of thing that could result from p-hacking over a set of similar simple tasks.
The prompt was in a style similar to the [Interpretability In The Wild](https://arxiv.org/abs/2211.00593) paper, where one token (′ an’) would be the top answer for the pre-patched prompt — the one with ‘apple’, and the other token (′ a’) would be the the top answer for the patched prompt — the one with ‘lemon’. The idea is that with these prompts is that we know that the top prediction is either ′ an’ or ′ a’, and we can measure the effect of each individual part of the model by seeing how much patching that part of the model sways the prediction towards the ′ a’ token.
To be clear, this can only tell us the significance of this neuron in this particular prompt, which is why we also tried to look at the behaviour of this neuron through other perspectives — which was looking at its activation over a larger, diverse dataset, and looking at its output weights.
Thanks, but I’m asking more about why you chose to study this particular thing instead of something else entirely. For example, why not study “this” versus “that” completions or any number of other simple things in the language model?
I don’t think there was much reason for choosing " a" vs." an" to study over something else. This was the first thing we investigated and we were excited to see a single neuron mechanism, so we kept going. Bear in mind this project originated in a 48 hour hackathon :)
How was the ′ a’ v. ′ an’ selection task selected? It seems quite convenient to probe for and also the kind of thing that could result from p-hacking over a set of similar simple tasks.
The prompt was in a style similar to the [Interpretability In The Wild](https://arxiv.org/abs/2211.00593) paper, where one token (′ an’) would be the top answer for the pre-patched prompt — the one with ‘apple’, and the other token (′ a’) would be the the top answer for the patched prompt — the one with ‘lemon’. The idea is that with these prompts is that we know that the top prediction is either ′ an’ or ′ a’, and we can measure the effect of each individual part of the model by seeing how much patching that part of the model sways the prediction towards the ′ a’ token.
To be clear, this can only tell us the significance of this neuron in this particular prompt, which is why we also tried to look at the behaviour of this neuron through other perspectives — which was looking at its activation over a larger, diverse dataset, and looking at its output weights.
Thanks, but I’m asking more about why you chose to study this particular thing instead of something else entirely. For example, why not study “this” versus “that” completions or any number of other simple things in the language model?
I don’t think there was much reason for choosing
" a"
vs." an"
to study over something else. This was the first thing we investigated and we were excited to see a single neuron mechanism, so we kept going. Bear in mind this project originated in a 48 hour hackathon :)