I assume you know this, but to be clear, OpenAI has already used pirated books. GPT-3 was trained on “books2” which appears to be all the text on libgen (and pretty much all the books on libgen have been through OCR). It was weighted the same as the common crawl, lower than Gutenberg or Reddit links. This seems to answer your second question: they will likely treat pdfs on the libgen the same as pdfs on the open web. If you’re asking about whether they will train the model on the pixels in these pdfs, which might make up for losses in OCR, I have no idea.
I assume you know this, but to be clear, OpenAI has already used pirated books. GPT-3 was trained on “books2” which appears to be all the text on libgen (and pretty much all the books on libgen have been through OCR). It was weighted the same as the common crawl, lower than Gutenberg or Reddit links. This seems to answer your second question: they will likely treat pdfs on the libgen the same as pdfs on the open web. If you’re asking about whether they will train the model on the pixels in these pdfs, which might make up for losses in OCR, I have no idea.
I did not know this; thanks!