Another interesting corpus (though problematic for legal reasons) would be sci-hub. Quick googling gives estimates of around 50 million research articles; the average research article runs around 4000 words, and sci-hub is estimated to contain about 69% of all research articles published in peer-reviewed journals. That would put sci-hub at about 50 million * 4000 = 200B tokens and the whole scientific journal literature at an estimated 290B tokens.
Another interesting corpus (though problematic for legal reasons) would be sci-hub. Quick googling gives estimates of around 50 million research articles; the average research article runs around 4000 words, and sci-hub is estimated to contain about 69% of all research articles published in peer-reviewed journals. That would put sci-hub at about 50 million * 4000 = 200B tokens and the whole scientific journal literature at an estimated 290B tokens.