Knowledge neurons don’t seem to include all of the model’s knowledge about a given question. Cutting them out only decreases the probability on the correct answer by 40%.
Yeah, agreed—though I would still say that finding the first ~40% of where knowledge of a particular fact is stored counts as progress (though I’m not saying they have necessarily done that).
I don’t think there’s evidence that these knowledge neurons don’t do a bunch of other stuff. After removing about 0.02% of neurons they found that the mean probability on other correct answers decreased by 0.4%. They describe this as “almost unchanged” but it seems like it’s larger than I’d expect for a model trained with dropout for knocking out random neurons (if you extrapolate that to knocking out 10% of the mlp neurons, as done during training, you’d have reduced the correct probability by 50x, whereas the model should still operate OK with 10% dropout).
That’s a good point—I didn’t look super carefully at their number there, but I agree that looking more carefully it does seem rather large.
Looking at that again, it seems potentially relevant that instead of zeroing those neurons they added the embedding of the [UNK] token.
I also thought this was somewhat strange and am not sure what to make of it.
A priori if a network did work this way, it’s unclear why individual neurons would correspond to individual lookups rather than using a distributed representation (and they probably wouldn’t, given sparsity—that’s a crazy inefficient thing to do and if anything seems harder for SGD to learn) so I’m not sure that this perspective even helps explain the observation that a small number of neurons can have a big effect on particular prompts.
I was also surprised that they used individual neurons rather than NMF factors or something—though the fact that it still worked while just using the neuron basis seems like more evidence that the effect is real rather than less.
But they don’t give any evidence that the transformation had a reliable effect, or that it didn’t mess up other stuff, or that they couldn’t have a similar effect by targeting other neurons.
Actually looking at the replacement stuff in detail it seems even weaker than that. Unless I’m missing something it looks like they only present 3 cherry-picked examples with no quantitative evaluation at all? It’s possible that they just didn’t care about exploring this effect experimentally, but I’d guess that they tried some simple stuff and found the effect to be super brittle and so didn’t report it. And in the cases they report, they are changing the model from remembering an incorrect fact to a correct one—that seems important because probably the model put significant probability on the correct thing already.
Perhaps I’m too trusting—I agree that everything you’re describing seems possible given just the evidence in the paper. All of this is testable, though, and suggests obvious future directions that seem worth exploring.
Yeah, agreed—though I would still say that finding the first ~40% of where knowledge of a particular fact is stored counts as progress (though I’m not saying they have necessarily done that).
That’s a good point—I didn’t look super carefully at their number there, but I agree that looking more carefully it does seem rather large.
I also thought this was somewhat strange and am not sure what to make of it.
I was also surprised that they used individual neurons rather than NMF factors or something—though the fact that it still worked while just using the neuron basis seems like more evidence that the effect is real rather than less.
Perhaps I’m too trusting—I agree that everything you’re describing seems possible given just the evidence in the paper. All of this is testable, though, and suggests obvious future directions that seem worth exploring.