Could you explain a little more about what you mean by data leakage? Do you mean that complete copies of the text sampled for the evaluation set exist in the training set? Is this one of those things where curating a good dataset is a surprising amount of the work of ML, and so a lot of people haven’t done it?
Edit: Oh. I have now looked at the Retro paper. I’d still be interested in hearing your take on what makes different datasets leaky.
Yes, exact or near-exact copies of the data existing in the database. One can also easily imagine examples where, for example, Wikitext103 has exact copies removed from the dataset, but exact translations remain, or where quotes from a Wikipedia article are interspersed throughout the internet, or some bot-generated website exposes some mangled data in a form the model figured out how to deconstruct.
In general, models will exploit leakage when available. Even non-retrieval models seem to memorize snippets of text fairly effectively, even though that seems like a somewhat difficult task for them architecturally. Datasets which amount to “basically the internet” will have pretty much all the leakage, and the paper all but proves their deduplication was not adequate. I do expect that it is difficult to curate a good dataset for evaluating a model like this.
Could you explain a little more about what you mean by data leakage? Do you mean that complete copies of the text sampled for the evaluation set exist in the training set? Is this one of those things where curating a good dataset is a surprising amount of the work of ML, and so a lot of people haven’t done it?
Edit: Oh. I have now looked at the Retro paper. I’d still be interested in hearing your take on what makes different datasets leaky.
Yes, exact or near-exact copies of the data existing in the database. One can also easily imagine examples where, for example, Wikitext103 has exact copies removed from the dataset, but exact translations remain, or where quotes from a Wikipedia article are interspersed throughout the internet, or some bot-generated website exposes some mangled data in a form the model figured out how to deconstruct.
In general, models will exploit leakage when available. Even non-retrieval models seem to memorize snippets of text fairly effectively, even though that seems like a somewhat difficult task for them architecturally. Datasets which amount to “basically the internet” will have pretty much all the leakage, and the paper all but proves their deduplication was not adequate. I do expect that it is difficult to curate a good dataset for evaluating a model like this.