Thanks for the link. This has been on my reading list for a little bit and your recco tipped me over.
Mostly I agree with Paul’s concerns about this paper.
However, I did find the “Transformer Feed-Forward Layers Are Key-Value Memories” paper they reference more interesting—it’s more mechanistic, and their results are pretty encouraging. I would personally highlight that one more, as it’s IMO stronger evidence for the hypothesis, although not conclusive by any means.
Some experiments they show:
Top-k activations of individual ‘keys’ do seem like coherent patterns in prefixes, and as we move up the layers, these patterns become less shallow and more semantics-driven. (Granted, it’s not clear how good the methodology there is, as to qualify as a pattern, it needs to occur in 3 out of top-25 prefixes. There are 3.6 patterns on average in each key. But this is curious enough to keep looking into.)
The ‘value’ distributions corresponding to the keys are in fact somewhat predictive of the actual next word for those top-k prefixes, and exhibit a kind of ‘calibration’: while the distributions themselves aren’t actually calibrated, they are more correct when they assign a higher probability.
I also find it very intriguing that you can just decode the value distributions using the embedding matrix a la Logit Lens.
Thanks for the link. This has been on my reading list for a little bit and your recco tipped me over.
Mostly I agree with Paul’s concerns about this paper.
However, I did find the “Transformer Feed-Forward Layers Are Key-Value Memories” paper they reference more interesting—it’s more mechanistic, and their results are pretty encouraging. I would personally highlight that one more, as it’s IMO stronger evidence for the hypothesis, although not conclusive by any means.
Some experiments they show:
Top-k activations of individual ‘keys’ do seem like coherent patterns in prefixes, and as we move up the layers, these patterns become less shallow and more semantics-driven. (Granted, it’s not clear how good the methodology there is, as to qualify as a pattern, it needs to occur in 3 out of top-25 prefixes. There are 3.6 patterns on average in each key. But this is curious enough to keep looking into.)
The ‘value’ distributions corresponding to the keys are in fact somewhat predictive of the actual next word for those top-k prefixes, and exhibit a kind of ‘calibration’: while the distributions themselves aren’t actually calibrated, they are more correct when they assign a higher probability.
I also find it very intriguing that you can just decode the value distributions using the embedding matrix a la Logit Lens.