Depends on how big of a model you’re trying to train, and how you’re trying to train it.
I was imagining something along the lines of “download the full 100TB torrent which includes 88M articles, extract the text of each article (“extract text from a given PDF” isn’t super reliable but it should be largely doable), which should leave you somewhere in the ballpark of 4TB of uncompressed plain text. If you’re using a BPE, that would leave you with ~1T tokens.
If you’re trying to do the chinchilla optimality thing, I fully agree that there’s no way you’re going to be able to do that with the compute budget available to mere mortals. If you’re trying to do the “generate embeddings for every paragraph of every paper, and do similarity searches, and then on matches calculate edit distance to see if it was literally copy-pasted” I think that’d be entirely doable with a hobbyist budget.
I personally think it’d be a great learning project.
Depends on how big of a model you’re trying to train, and how you’re trying to train it.
I was imagining something along the lines of “download the full 100TB torrent which includes 88M articles, extract the text of each article (“extract text from a given PDF” isn’t super reliable but it should be largely doable), which should leave you somewhere in the ballpark of 4TB of uncompressed plain text. If you’re using a BPE, that would leave you with ~1T tokens.
If you’re trying to do the chinchilla optimality thing, I fully agree that there’s no way you’re going to be able to do that with the compute budget available to mere mortals. If you’re trying to do the “generate embeddings for every paragraph of every paper, and do similarity searches, and then on matches calculate edit distance to see if it was literally copy-pasted” I think that’d be entirely doable with a hobbyist budget.
I personally think it’d be a great learning project.