Douglas_Knight comments on Which parts of the existing internet are already likely to be in (GPT-5/other soon-to-be-trained LLMs)’s training corpus?

Douglas_Knight 29 Mar 2023 18:47 UTC
41 points
6
I assume you know this, but to be clear, OpenAI has already used pirated books. GPT-3 was trained on “books2” which appears to be all the text on libgen (and pretty much all the books on libgen have been through OCR). It was weighted the same as the common crawl, lower than Gutenberg or Reddit links. This seems to answer your second question: they will likely treat pdfs on the libgen the same as pdfs on the open web. If you’re asking about whether they will train the model on the pixels in these pdfs, which might make up for losses in OCR, I have no idea.
- AnnaSalamon 29 Mar 2023 19:15 UTC
  5 points
  0
  Parent
  I did not know this; thanks!