Skimming the Rᴇᴛʀᴏ paper is weird because it looks like there’s leakage everywhere, they admit leakage is everywhere, but then they sort of report results like it doesn’t matter, even putting a result on their leakiest dataset in their conclusion?
On Wikitext103 and the Pile, Retro outperforms previous models trained on large scale datasets.
It looks to me like Figure 6 is saying the improvement is fairly modest in unleaky datasets?
Maybe someone who has gone over the paper in detail can chime in with thoughts.
Could you explain a little more about what you mean by data leakage? Do you mean that complete copies of the text sampled for the evaluation set exist in the training set? Is this one of those things where curating a good dataset is a surprising amount of the work of ML, and so a lot of people haven’t done it?
Edit: Oh. I have now looked at the Retro paper. I’d still be interested in hearing your take on what makes different datasets leaky.
Yes, exact or near-exact copies of the data existing in the database. One can also easily imagine examples where, for example, Wikitext103 has exact copies removed from the dataset, but exact translations remain, or where quotes from a Wikipedia article are interspersed throughout the internet, or some bot-generated website exposes some mangled data in a form the model figured out how to deconstruct.
In general, models will exploit leakage when available. Even non-retrieval models seem to memorize snippets of text fairly effectively, even though that seems like a somewhat difficult task for them architecturally. Datasets which amount to “basically the internet” will have pretty much all the leakage, and the paper all but proves their deduplication was not adequate. I do expect that it is difficult to curate a good dataset for evaluating a model like this.
Skimming the Rᴇᴛʀᴏ paper is weird because it looks like there’s leakage everywhere, they admit leakage is everywhere, but then they sort of report results like it doesn’t matter, even putting a result on their leakiest dataset in their conclusion?
It looks to me like Figure 6 is saying the improvement is fairly modest in unleaky datasets?
Maybe someone who has gone over the paper in detail can chime in with thoughts.
Could you explain a little more about what you mean by data leakage? Do you mean that complete copies of the text sampled for the evaluation set exist in the training set? Is this one of those things where curating a good dataset is a surprising amount of the work of ML, and so a lot of people haven’t done it?
Edit: Oh. I have now looked at the Retro paper. I’d still be interested in hearing your take on what makes different datasets leaky.
Yes, exact or near-exact copies of the data existing in the database. One can also easily imagine examples where, for example, Wikitext103 has exact copies removed from the dataset, but exact translations remain, or where quotes from a Wikipedia article are interspersed throughout the internet, or some bot-generated website exposes some mangled data in a form the model figured out how to deconstruct.
In general, models will exploit leakage when available. Even non-retrieval models seem to memorize snippets of text fairly effectively, even though that seems like a somewhat difficult task for them architecturally. Datasets which amount to “basically the internet” will have pretty much all the leakage, and the paper all but proves their deduplication was not adequate. I do expect that it is difficult to curate a good dataset for evaluating a model like this.