Datasets for language models have rapidly expanded, culminating in the Common Crawl dataset [RSR+19] constituting nearly a trillion words. This size of dataset is sufficient to train our largest models without ever updating on the same sequence twice. However, we have found that unfiltered or lightly filtered versions of Common Crawl tend to have lower quality than more curated datasets. Therefore, we took 3 steps to improve the average quality of our datasets: (1) we downloaded and filtered a version of CommonCrawl based on similarity to a range of high-quality reference corpora, (2) we performed fuzzy deduplication at the document level, within and across datasets, to prevent redundancy and preserve the integrity of our held-out validation set as an accurate measure of overfitting, and (3) we also added known high-quality reference corpora to the training mix to augment CommonCrawl and increase its diversity.
Their appendix A describes the their process for filtering documents in greater detail. It is difficult to know exactly what was in their real dataset after the filtering process, as I am not aware of any version available that people can download (though perhaps there is one; I haven’t looked hard).
That said, we can guess the likelihood of a given document being used to train GPT-3 by searching for it in the Common Crawl URL Index. GPT-3 was presumably trained in early 2020. If you assume March 2020, visit this link, and enter a query. To search a whole domain, such as lesswrong.com, add a wildcard e.g. ”https://www.lesswrong.com/*” (without quotes).
The GPT-3 paper says,
Their appendix A describes the their process for filtering documents in greater detail. It is difficult to know exactly what was in their real dataset after the filtering process, as I am not aware of any version available that people can download (though perhaps there is one; I haven’t looked hard).
That said, we can guess the likelihood of a given document being used to train GPT-3 by searching for it in the Common Crawl URL Index. GPT-3 was presumably trained in early 2020. If you assume March 2020, visit this link, and enter a query. To search a whole domain, such as lesswrong.com, add a wildcard e.g. ”https://www.lesswrong.com/*” (without quotes).