I’m curious how you’d define memorisation? To me, I’d actually count this as the model learning features …
Qualitatively, when I discuss “memorization” in language models, I’m primarily referring to the phenomenon of languages models producing long quotes verbatim if primed with a certain start. I mean it as a more neutral term than overfitting.
Mechanistically, the simplest version I imagine is a feature which activates when the preceding N tokens match a particular pattern, and predicts a specific N+1 token. Such a feature is analogous to the “single data point features” in this paper. In practice, I expect you can have the same feature also make predictions about the N+2, N+3, etc tokens via attention heads.
This is quite different from a normal feature in the fact that it’s matching a very specific, exact pattern.
a bunch of examples will contain the Bible verse as a substring, and so there’s a non-trivial probability that any input contains it, so this is a genuine property of the data distribution.
Agreed! This is why I’m describing it as “memorization” (which, again, I mean more neutrally than overfitting in the context of LLMs) and highlight that it really does seem like language models morally should do this.
Although there’s also lots of SEO spam that language models memorize because it’s repeated which one might think of as overfitting, even though they’re a property of the training distribution.
Qualitatively, when I discuss “memorization” in language models, I’m primarily referring to the phenomenon of languages models producing long quotes verbatim if primed with a certain start. I mean it as a more neutral term than overfitting.
Mechanistically, the simplest version I imagine is a feature which activates when the preceding N tokens match a particular pattern, and predicts a specific N+1 token. Such a feature is analogous to the “single data point features” in this paper. In practice, I expect you can have the same feature also make predictions about the N+2, N+3, etc tokens via attention heads.
This is quite different from a normal feature in the fact that it’s matching a very specific, exact pattern.
Agreed! This is why I’m describing it as “memorization” (which, again, I mean more neutrally than overfitting in the context of LLMs) and highlight that it really does seem like language models morally should do this.
Although there’s also lots of SEO spam that language models memorize because it’s repeated which one might think of as overfitting, even though they’re a property of the training distribution.