I’m inclined to be more skeptical of these results.
I agree that this paper demonstrates that it’s possible to interfere with a small number of neurons in order to mess up retrieval of a particular fact (roughly 6 out of the 40k mlp neurons if I understand correctly), which definitely tells you something about what the model is doing.
But beyond that I think the inferences are dicier:
Knowledge neurons don’t seem to include all of the model’s knowledge about a given question. Cutting them out only decreases the probability on the correct answer by 40%. This makes it sound like accuracy is still quite high for these relational facts even after removing the 6 knowledge neurons, and certainly the model still knows a lot about them. (It’s hard to tell because this is a weird way to report deterioration of the model’s knowledge and I actually don’t know exactly how they defined it.) And of course the correct operation of a model can be fragile, so even if knocking out 6 neurons did mess up knowledge about topic X that doesn’t really show you’ve localized knowledge about X.
I don’t think there’s evidence that these knowledge neurons don’t do a bunch of other stuff. After removing about 0.02% of neurons they found that the mean probability on other correct answers decreased by 0.4%. They describe this as “almost unchanged” but it seems like it’s larger than I’d expect for a model trained with dropout for knocking out random neurons (if you extrapolate that to knocking out 10% of the mlp neurons, as done during training, you’d have reduced the correct probability by 50x, whereas the model should still operate OK with 10% dropout). So I think this shows that the knowledge neurons are more important for the knowledge in question than for typical inputs, rather than e.g. the model being brittle so that knocking out 6 neurons messes up everything, but definitely doesn’t tell us much about whether the knowledge neurons do only this fact.
For groups of neurons, they instead report accuracy and find it falls by about half. Again, this does show that they’ve found 20 neurons that are particularly important for e.g. messing up facts about capitals (though clearly not containing all the info about capitals). I agree this is more impressive than the individual knowledge neurons, but man there are lots of ways to get this kind of result even if the mechanical story of what the neuron is doing is 100% wrong.
Looking at that again, it seems potentially relevant that instead of zeroing those neurons they added the embedding of the [UNK] token. I don’t know what the effect of that is (or why they changed), you could easily imagine it kind of clobbering the model. I think I’d feel more comfortable if they had showed that the model still worked OK. But at this point they switch to a different way of measuring performance on other prompts (perplexity). It’s not clear whether they mean perplexity just on the masked token or on all the tokens, and I haven’t thought about these numbers in detail, but I think they might imply that the [UNK] replacement did quite a lot of damage.
I doubt that the attribution method matters much for identifying these neurons (and I don’t think any evidence is given that it does). I suspect anything reasonable would work fine. Moreover, I’m not sure it’s more impressive to get this result using this attribution method rather than by directly searching for neurons whose removal causes bad behavior (which would probably produce even better results). It seems like the attribution method can just act as a slightly noisy way of doing that.
When you say “In general, the feed-forward layers can be thought of...” I agree that it’s possible to think of them that way, but don’t see much evidence that this is a good way to think about how the feed-forward layers work, or even how these neurons work. A priori if a network did work this way, it’s unclear why individual neurons would correspond to individual lookups rather than using a distributed representation (and they probably wouldn’t, given sparsity—that’s a crazy inefficient thing to do and if anything seems harder for SGD to learn) so I’m not sure that this perspective even helps explain the observation that a small number of neurons can have a big effect on particular prompts.
I think the evidence about replacement is kind of weak. In general, adding t—t’ to the activations at any layer should increase the probability the model says t and decrease the probability it says t’. So adding it to the weights of any activated neuron should tend to have this effect, if only in a super crude way (as well as probably messing up a bunch of other stuff). But they don’t give any evidence that the transformation had a reliable effect, or that it didn’t mess up other stuff, or that they couldn’t have a similar effect by targeting other neurons.
Actually looking at the replacement stuff in detail it seems even weaker than that. Unless I’m missing something it looks like they only present 3 cherry-picked examples with no quantitative evaluation at all? It’s possible that they just didn’t care about exploring this effect experimentally, but I’d guess that they tried some simple stuff and found the effect to be super brittle and so didn’t report it. And in the cases they report, they are changing the model from remembering an incorrect fact to a correct one—that seems important because probably the model put significant probability on the correct thing already.
When you say “the prompts that most activate the knowledge neuron all contain the relevant relational fact.” I think the more precise statement is “Amongst 20 random prompts that mention both Azerbaijan and Baku, the ones that most activate the knowledge neuron also include the word ‘capital’.” I think this is relatively unsurprising—if you select a neuron to respond to various permutations of (Azerbaijan, Baku, capital) then it makes sense that removing one of the three words would decrease the activation of the neuron relative to having all three, regardless of whether the neuron is a knowledge neuron or anything else that happens to be activated on those prompts.
I think the fact that amplifying knowledge neurons helps accuracy doesn’t really help the story, and if anything slightly undercuts it. This is exactly what you’d expect if the neurons just happened to have a big gradient on those inputs. But if the model is actually doing knowledge retrieval, then it seems like it would be somewhat less likely you could make things so much better by just increasing the magnitude of the values.
Knowledge neurons don’t seem to include all of the model’s knowledge about a given question. Cutting them out only decreases the probability on the correct answer by 40%.
Yeah, agreed—though I would still say that finding the first ~40% of where knowledge of a particular fact is stored counts as progress (though I’m not saying they have necessarily done that).
I don’t think there’s evidence that these knowledge neurons don’t do a bunch of other stuff. After removing about 0.02% of neurons they found that the mean probability on other correct answers decreased by 0.4%. They describe this as “almost unchanged” but it seems like it’s larger than I’d expect for a model trained with dropout for knocking out random neurons (if you extrapolate that to knocking out 10% of the mlp neurons, as done during training, you’d have reduced the correct probability by 50x, whereas the model should still operate OK with 10% dropout).
That’s a good point—I didn’t look super carefully at their number there, but I agree that looking more carefully it does seem rather large.
Looking at that again, it seems potentially relevant that instead of zeroing those neurons they added the embedding of the [UNK] token.
I also thought this was somewhat strange and am not sure what to make of it.
A priori if a network did work this way, it’s unclear why individual neurons would correspond to individual lookups rather than using a distributed representation (and they probably wouldn’t, given sparsity—that’s a crazy inefficient thing to do and if anything seems harder for SGD to learn) so I’m not sure that this perspective even helps explain the observation that a small number of neurons can have a big effect on particular prompts.
I was also surprised that they used individual neurons rather than NMF factors or something—though the fact that it still worked while just using the neuron basis seems like more evidence that the effect is real rather than less.
But they don’t give any evidence that the transformation had a reliable effect, or that it didn’t mess up other stuff, or that they couldn’t have a similar effect by targeting other neurons.
Actually looking at the replacement stuff in detail it seems even weaker than that. Unless I’m missing something it looks like they only present 3 cherry-picked examples with no quantitative evaluation at all? It’s possible that they just didn’t care about exploring this effect experimentally, but I’d guess that they tried some simple stuff and found the effect to be super brittle and so didn’t report it. And in the cases they report, they are changing the model from remembering an incorrect fact to a correct one—that seems important because probably the model put significant probability on the correct thing already.
Perhaps I’m too trusting—I agree that everything you’re describing seems possible given just the evidence in the paper. All of this is testable, though, and suggests obvious future directions that seem worth exploring.
I’m inclined to be more skeptical of these results.
I agree that this paper demonstrates that it’s possible to interfere with a small number of neurons in order to mess up retrieval of a particular fact (roughly 6 out of the 40k mlp neurons if I understand correctly), which definitely tells you something about what the model is doing.
But beyond that I think the inferences are dicier:
Knowledge neurons don’t seem to include all of the model’s knowledge about a given question. Cutting them out only decreases the probability on the correct answer by 40%. This makes it sound like accuracy is still quite high for these relational facts even after removing the 6 knowledge neurons, and certainly the model still knows a lot about them. (It’s hard to tell because this is a weird way to report deterioration of the model’s knowledge and I actually don’t know exactly how they defined it.) And of course the correct operation of a model can be fragile, so even if knocking out 6 neurons did mess up knowledge about topic X that doesn’t really show you’ve localized knowledge about X.
I don’t think there’s evidence that these knowledge neurons don’t do a bunch of other stuff. After removing about 0.02% of neurons they found that the mean probability on other correct answers decreased by 0.4%. They describe this as “almost unchanged” but it seems like it’s larger than I’d expect for a model trained with dropout for knocking out random neurons (if you extrapolate that to knocking out 10% of the mlp neurons, as done during training, you’d have reduced the correct probability by 50x, whereas the model should still operate OK with 10% dropout). So I think this shows that the knowledge neurons are more important for the knowledge in question than for typical inputs, rather than e.g. the model being brittle so that knocking out 6 neurons messes up everything, but definitely doesn’t tell us much about whether the knowledge neurons do only this fact.
For groups of neurons, they instead report accuracy and find it falls by about half. Again, this does show that they’ve found 20 neurons that are particularly important for e.g. messing up facts about capitals (though clearly not containing all the info about capitals). I agree this is more impressive than the individual knowledge neurons, but man there are lots of ways to get this kind of result even if the mechanical story of what the neuron is doing is 100% wrong.
Looking at that again, it seems potentially relevant that instead of zeroing those neurons they added the embedding of the [UNK] token. I don’t know what the effect of that is (or why they changed), you could easily imagine it kind of clobbering the model. I think I’d feel more comfortable if they had showed that the model still worked OK. But at this point they switch to a different way of measuring performance on other prompts (perplexity). It’s not clear whether they mean perplexity just on the masked token or on all the tokens, and I haven’t thought about these numbers in detail, but I think they might imply that the [UNK] replacement did quite a lot of damage.
I doubt that the attribution method matters much for identifying these neurons (and I don’t think any evidence is given that it does). I suspect anything reasonable would work fine. Moreover, I’m not sure it’s more impressive to get this result using this attribution method rather than by directly searching for neurons whose removal causes bad behavior (which would probably produce even better results). It seems like the attribution method can just act as a slightly noisy way of doing that.
When you say “In general, the feed-forward layers can be thought of...” I agree that it’s possible to think of them that way, but don’t see much evidence that this is a good way to think about how the feed-forward layers work, or even how these neurons work. A priori if a network did work this way, it’s unclear why individual neurons would correspond to individual lookups rather than using a distributed representation (and they probably wouldn’t, given sparsity—that’s a crazy inefficient thing to do and if anything seems harder for SGD to learn) so I’m not sure that this perspective even helps explain the observation that a small number of neurons can have a big effect on particular prompts.
I think the evidence about replacement is kind of weak. In general, adding t—t’ to the activations at any layer should increase the probability the model says t and decrease the probability it says t’. So adding it to the weights of any activated neuron should tend to have this effect, if only in a super crude way (as well as probably messing up a bunch of other stuff). But they don’t give any evidence that the transformation had a reliable effect, or that it didn’t mess up other stuff, or that they couldn’t have a similar effect by targeting other neurons.
Actually looking at the replacement stuff in detail it seems even weaker than that. Unless I’m missing something it looks like they only present 3 cherry-picked examples with no quantitative evaluation at all? It’s possible that they just didn’t care about exploring this effect experimentally, but I’d guess that they tried some simple stuff and found the effect to be super brittle and so didn’t report it. And in the cases they report, they are changing the model from remembering an incorrect fact to a correct one—that seems important because probably the model put significant probability on the correct thing already.
When you say “the prompts that most activate the knowledge neuron all contain the relevant relational fact.” I think the more precise statement is “Amongst 20 random prompts that mention both Azerbaijan and Baku, the ones that most activate the knowledge neuron also include the word ‘capital’.” I think this is relatively unsurprising—if you select a neuron to respond to various permutations of (Azerbaijan, Baku, capital) then it makes sense that removing one of the three words would decrease the activation of the neuron relative to having all three, regardless of whether the neuron is a knowledge neuron or anything else that happens to be activated on those prompts.
I think the fact that amplifying knowledge neurons helps accuracy doesn’t really help the story, and if anything slightly undercuts it. This is exactly what you’d expect if the neurons just happened to have a big gradient on those inputs. But if the model is actually doing knowledge retrieval, then it seems like it would be somewhat less likely you could make things so much better by just increasing the magnitude of the values.
Yeah, agreed—though I would still say that finding the first ~40% of where knowledge of a particular fact is stored counts as progress (though I’m not saying they have necessarily done that).
That’s a good point—I didn’t look super carefully at their number there, but I agree that looking more carefully it does seem rather large.
I also thought this was somewhat strange and am not sure what to make of it.
I was also surprised that they used individual neurons rather than NMF factors or something—though the fact that it still worked while just using the neuron basis seems like more evidence that the effect is real rather than less.
Perhaps I’m too trusting—I agree that everything you’re describing seems possible given just the evidence in the paper. All of this is testable, though, and suggests obvious future directions that seem worth exploring.