But language models seem like they morally should memorize some data points. Language models should recite the US constitution and Shakespeare and the Bible
I’m curious how you’d define memorisation? To me, I’d actually count this as the model learning features—a bunch of examples will contain the Bible verse as a substring, and so there’s a non-trivial probability that any input contains it, so this is a genuine property of the data distribution. It feels analogous to the model learning bigrams or trigrams, which are basically memorising 2 or 3 token substrings—in some sense this is memorisation, but to me it’s a genuine feature of the data distribution.
My best attempt to operationalise memorisation is that it’s about ways that the training data that differs from the training data distribution. If some string has infinitessimal probability of occurring in a randomly sampled data point, but occurs in a single training example, then that feels like memorisation. But if something occurs in a bunch of training examples, it probably occurs with non-trivial probability in any sample from data distribution.
Alternately, it’s memorisation if it results in significantly better loss on the training set than test set (assuming they’re from the same data distribution).
I might be relying too hard on the notion of a data distribution though.
This also feels like an interesting difference between continuous data like images and discrete data like language—language can often have identical substrings in a way that seems much weirder in images (patches that are identical down to the pixel level?), so it feels harder to disentangle memorisation from generalisation in language.
An operational definition which I find helpful for thinking about memorization is Zhang et al’scounterfactual memorization.
The counterfactual memorization of a document x is (roughly) the amount that the model’s loss on x degrades when you remove x from its training dataset.
More precisely, it’s the difference in expected loss on x between models trained on data distribution samples that happen to include x, and models trained on data distribution samples that happen not to include x.
This will be lower for documents that are easy for the LM to predict using general features learned elsewhere, and higher for documents that the LM can’t predict well except by memorizing them. For example (these are intuitive guesses, not experimental results!):
A document xUUID containing a list of random UUIDs will have higher counterfactual memorization than a document xREP containing the word “the” repeated many times.
If we extend the definition slightly to cover training sets with fewer or more copies of a document x, then a document repeated many times in the training set will have higher counterfactual memorization than a document that appears only once.
Repeating xUUID many times, or doing many epochs over it, will produce more counterfactual memorization than doing the same thing with xREP. (The counterfactual memorization for xREP is upper bounded by the loss on xREP attained by a model that never even sees it once in training, and that’s already low to begin with.)
Note that the true likelihood under the data distribution only matters through its effect on the likelihood predicted by the LM. On average, likely texts will be easier than unlikely ones, but when these two things come apart, easy-vs-hard is what matters. xUUID is more plausible as natural text than xREP, but it’s harder for the LM to predict, so it has higher counterfactual memorization.
On the other hand, if we put many near duplicates of a document in the dataset—say, many copies with a random edit to a single token—then every individual near-duplicate will have low counterfactual memorization.
This is not very satisfying, since it feels like something is getting memorized here, even if it’s not localized in a single document.
To fix the problem, we might imagine broadening the concept of “whether a document is in the training set.” For example, instead of keeping or removing an literal document, we might keep/remove every document that includes a specific substring like a Bible quote.
But if we keep doing this, for increasingly abstract and distant notions of “near duplication” (e.g. “remove all documents that are about frogs, even if they don’t contain the word ‘frog’”) -- then we’re eventually just talking about generalization!
Perhaps we could define memorization in a more general way in terms of distances along this spectrum. If we can select examples for removal using a very simple function, and removing the selected examples from the training set destroys the model’s performance on them, then it was memorizing them. But if the “document selection function” grows more complex, and starts to do generalization internally, we then say the model is generalizing as opposed to memorizing.
(ETA: though we also need some sort of restriction on the total number of documents removed. “Remove all documents containing some common word” and “remove all but the first document” are simple rules with very damaging effects, but obviously they don’t tell us anything about whether those subsets were memorized.)
Hmm, this comment ended up more involved than I originally intended … mostly I wanted to drop a reference to counterfactual memorization. Hope this was of some interest anyway.
Super interesting, thanks! I hadn’t come across that work before, and that’s a cute and elegant definition.
To me, it’s natural to extend this to specific substrings in the document? I believe that models are trained with documents chopped up and concatenated to fit into segment that fully fit the context window, so it feels odd to talk about document as the unit of analysis. And in some sense a 1000 token document is actually 1000 sub-tasks of predicting token k given the prefix up to token k-1, each of which can be memorised.
Maybe we should just not apply a gradient update to the tokens in the repeated substring? But keep the document in and measure loss on the rest.
I’m curious how you’d define memorisation? To me, I’d actually count this as the model learning features …
Qualitatively, when I discuss “memorization” in language models, I’m primarily referring to the phenomenon of languages models producing long quotes verbatim if primed with a certain start. I mean it as a more neutral term than overfitting.
Mechanistically, the simplest version I imagine is a feature which activates when the preceding N tokens match a particular pattern, and predicts a specific N+1 token. Such a feature is analogous to the “single data point features” in this paper. In practice, I expect you can have the same feature also make predictions about the N+2, N+3, etc tokens via attention heads.
This is quite different from a normal feature in the fact that it’s matching a very specific, exact pattern.
a bunch of examples will contain the Bible verse as a substring, and so there’s a non-trivial probability that any input contains it, so this is a genuine property of the data distribution.
Agreed! This is why I’m describing it as “memorization” (which, again, I mean more neutrally than overfitting in the context of LLMs) and highlight that it really does seem like language models morally should do this.
Although there’s also lots of SEO spam that language models memorize because it’s repeated which one might think of as overfitting, even though they’re a property of the training distribution.
Interesting context, thanks for writing it up!
I’m curious how you’d define memorisation? To me, I’d actually count this as the model learning features—a bunch of examples will contain the Bible verse as a substring, and so there’s a non-trivial probability that any input contains it, so this is a genuine property of the data distribution. It feels analogous to the model learning bigrams or trigrams, which are basically memorising 2 or 3 token substrings—in some sense this is memorisation, but to me it’s a genuine feature of the data distribution.
My best attempt to operationalise memorisation is that it’s about ways that the training data that differs from the training data distribution. If some string has infinitessimal probability of occurring in a randomly sampled data point, but occurs in a single training example, then that feels like memorisation. But if something occurs in a bunch of training examples, it probably occurs with non-trivial probability in any sample from data distribution.
Alternately, it’s memorisation if it results in significantly better loss on the training set than test set (assuming they’re from the same data distribution).
I might be relying too hard on the notion of a data distribution though.
This also feels like an interesting difference between continuous data like images and discrete data like language—language can often have identical substrings in a way that seems much weirder in images (patches that are identical down to the pixel level?), so it feels harder to disentangle memorisation from generalisation in language.
An operational definition which I find helpful for thinking about memorization is Zhang et al’s counterfactual memorization.
The counterfactual memorization of a document x is (roughly) the amount that the model’s loss on x degrades when you remove x from its training dataset.
More precisely, it’s the difference in expected loss on x between models trained on data distribution samples that happen to include x, and models trained on data distribution samples that happen not to include x.
This will be lower for documents that are easy for the LM to predict using general features learned elsewhere, and higher for documents that the LM can’t predict well except by memorizing them. For example (these are intuitive guesses, not experimental results!):
A document xUUID containing a list of random UUIDs will have higher counterfactual memorization than a document xREP containing the word “the” repeated many times.
If we extend the definition slightly to cover training sets with fewer or more copies of a document x, then a document repeated many times in the training set will have higher counterfactual memorization than a document that appears only once.
Repeating xUUID many times, or doing many epochs over it, will produce more counterfactual memorization than doing the same thing with xREP. (The counterfactual memorization for xREP is upper bounded by the loss on xREP attained by a model that never even sees it once in training, and that’s already low to begin with.)
Note that the true likelihood under the data distribution only matters through its effect on the likelihood predicted by the LM. On average, likely texts will be easier than unlikely ones, but when these two things come apart, easy-vs-hard is what matters. xUUID is more plausible as natural text than xREP, but it’s harder for the LM to predict, so it has higher counterfactual memorization.
On the other hand, if we put many near duplicates of a document in the dataset—say, many copies with a random edit to a single token—then every individual near-duplicate will have low counterfactual memorization.
This is not very satisfying, since it feels like something is getting memorized here, even if it’s not localized in a single document.
To fix the problem, we might imagine broadening the concept of “whether a document is in the training set.” For example, instead of keeping or removing an literal document, we might keep/remove every document that includes a specific substring like a Bible quote.
But if we keep doing this, for increasingly abstract and distant notions of “near duplication” (e.g. “remove all documents that are about frogs, even if they don’t contain the word ‘frog’”) -- then we’re eventually just talking about generalization!
Perhaps we could define memorization in a more general way in terms of distances along this spectrum. If we can select examples for removal using a very simple function, and removing the selected examples from the training set destroys the model’s performance on them, then it was memorizing them. But if the “document selection function” grows more complex, and starts to do generalization internally, we then say the model is generalizing as opposed to memorizing.
(ETA: though we also need some sort of restriction on the total number of documents removed. “Remove all documents containing some common word” and “remove all but the first document” are simple rules with very damaging effects, but obviously they don’t tell us anything about whether those subsets were memorized.)
Hmm, this comment ended up more involved than I originally intended … mostly I wanted to drop a reference to counterfactual memorization. Hope this was of some interest anyway.
Super interesting, thanks! I hadn’t come across that work before, and that’s a cute and elegant definition.
To me, it’s natural to extend this to specific substrings in the document? I believe that models are trained with documents chopped up and concatenated to fit into segment that fully fit the context window, so it feels odd to talk about document as the unit of analysis. And in some sense a 1000 token document is actually 1000 sub-tasks of predicting token k given the prefix up to token k-1, each of which can be memorised.
Maybe we should just not apply a gradient update to the tokens in the repeated substring? But keep the document in and measure loss on the rest.
Qualitatively, when I discuss “memorization” in language models, I’m primarily referring to the phenomenon of languages models producing long quotes verbatim if primed with a certain start. I mean it as a more neutral term than overfitting.
Mechanistically, the simplest version I imagine is a feature which activates when the preceding N tokens match a particular pattern, and predicts a specific N+1 token. Such a feature is analogous to the “single data point features” in this paper. In practice, I expect you can have the same feature also make predictions about the N+2, N+3, etc tokens via attention heads.
This is quite different from a normal feature in the fact that it’s matching a very specific, exact pattern.
Agreed! This is why I’m describing it as “memorization” (which, again, I mean more neutrally than overfitting in the context of LLMs) and highlight that it really does seem like language models morally should do this.
Although there’s also lots of SEO spam that language models memorize because it’s repeated which one might think of as overfitting, even though they’re a property of the training distribution.