Yes, exact or near-exact copies of the data existing in the database. One can also easily imagine examples where, for example, Wikitext103 has exact copies removed from the dataset, but exact translations remain, or where quotes from a Wikipedia article are interspersed throughout the internet, or some bot-generated website exposes some mangled data in a form the model figured out how to deconstruct.
In general, models will exploit leakage when available. Even non-retrieval models seem to memorize snippets of text fairly effectively, even though that seems like a somewhat difficult task for them architecturally. Datasets which amount to “basically the internet” will have pretty much all the leakage, and the paper all but proves their deduplication was not adequate. I do expect that it is difficult to curate a good dataset for evaluating a model like this.
Yes, exact or near-exact copies of the data existing in the database. One can also easily imagine examples where, for example, Wikitext103 has exact copies removed from the dataset, but exact translations remain, or where quotes from a Wikipedia article are interspersed throughout the internet, or some bot-generated website exposes some mangled data in a form the model figured out how to deconstruct.
In general, models will exploit leakage when available. Even non-retrieval models seem to memorize snippets of text fairly effectively, even though that seems like a somewhat difficult task for them architecturally. Datasets which amount to “basically the internet” will have pretty much all the leakage, and the paper all but proves their deduplication was not adequate. I do expect that it is difficult to curate a good dataset for evaluating a model like this.