[2104.07143v1] An Interpretability Illusion for BERT (arxiv.org) suggests a more complicated picture wherein many neurons give the impression that they’re encoding coherent concepts, but then seem to encode completely different concepts when tested on a different dataset. They’re certainly not directly contradictory, but Figure 2 of the illusion paper suggests the opposite of what Figure 5 of the Knowledge Neuron’s paper suggests. On the other hand, the illusion paper mentions they found tentative evidence for the existence of global concept directions and perhaps all knowledge neurons are such global concept directions.
Ordered from most to least plausible, possible explanations for this apparent discrepancy include:
Knowledge neurons are more specialized than the average neuron (knowledge neurons are ‘global’)
Dataset choice matters. In particular, Pararel sentences isolate relations in a way that other datasets don’t, helping to identify specialized neurons
Attribution method matters
Layer choice matters (Illusion papers mentions quick looks at layers 2 and 7 showed similar results, Knowledge neuron paper motivates the layer choice by analogy to key-value pairs)
[2104.07143v1] An Interpretability Illusion for BERT (arxiv.org) suggests a more complicated picture wherein many neurons give the impression that they’re encoding coherent concepts, but then seem to encode completely different concepts when tested on a different dataset. They’re certainly not directly contradictory, but Figure 2 of the illusion paper suggests the opposite of what Figure 5 of the Knowledge Neuron’s paper suggests. On the other hand, the illusion paper mentions they found tentative evidence for the existence of global concept directions and perhaps all knowledge neurons are such global concept directions.
Ordered from most to least plausible, possible explanations for this apparent discrepancy include:
Knowledge neurons are more specialized than the average neuron (knowledge neurons are ‘global’)
Dataset choice matters. In particular, Pararel sentences isolate relations in a way that other datasets don’t, helping to identify specialized neurons
Attribution method matters
Layer choice matters (Illusion papers mentions quick looks at layers 2 and 7 showed similar results, Knowledge neuron paper motivates the layer choice by analogy to key-value pairs)