merely being accessible online doesn’t get them in the training set of capabilities researchers’ AIs. Collecting books to contribute to LLM datasets seems like a good idea, but it’s ideologically loaded.
I think scraping reddit is common. The SSC subreddit is pretty popular. I wonder if there could be a post on that subreddit that was just a space for people to publish books in the comments.
I feel like we have very different models of how people get their datasets. I’m pretty sure you’d have to just hand someone a dataset and say “here I downloaded some books for your agi kid to read”
My model is that OpenAI and Anthropic researchers set up a web-scraper that reads through lots of popular internal reddit links (or possibly literally all of reddit) and then uses all of that as the training data for their language models.
...googling shows this as the official answer for GPT-3, which contains a lot of the popular and public internet. I am unclear whether that contains reddit, but if not then I believe I heard that they made a crawler specifically for reddit.
I was figuring GPT4 was already trained on a sizable fraction of the internet, and GPT5 would be trained on basically all the text (plus maybe some not-text, not sure). Is this wrong?
Oh hmm—that could be true. I suspect that data curation is too important though, there are significant gains to be had by not including confusing data as positive examples. [Loading paper links...]
significant gains to be had by not including confusing data
But things like pre-training with preferences should take care of that concern, no? Just mark good stuff with a magic good-stuff token, but allow the transformer to refine features for everything.
merely being accessible online doesn’t get them in the training set of capabilities researchers’ AIs. Collecting books to contribute to LLM datasets seems like a good idea, but it’s ideologically loaded.
I think scraping reddit is common. The SSC subreddit is pretty popular. I wonder if there could be a post on that subreddit that was just a space for people to publish books in the comments.
I feel like we have very different models of how people get their datasets. I’m pretty sure you’d have to just hand someone a dataset and say “here I downloaded some books for your agi kid to read”
My model is that OpenAI and Anthropic researchers set up a web-scraper that reads through lots of popular internal reddit links (or possibly literally all of reddit) and then uses all of that as the training data for their language models.
...googling shows this as the official answer for GPT-3, which contains a lot of the popular and public internet. I am unclear whether that contains reddit, but if not then I believe I heard that they made a crawler specifically for reddit.
But are they going to do that again? GPT4 used the same training set as GPT3 didn’t it?
Ah, I was under a misapprehension, I thought the data was much more recent, but the GPT-4 page says:
However that is after GPT-3 was released (June 2020), so it’s a new dataset.
Extrapolating naively, 2 years from now we will see GPT-5 trained on data from today.
I was figuring GPT4 was already trained on a sizable fraction of the internet, and GPT5 would be trained on basically all the text (plus maybe some not-text, not sure). Is this wrong?
Oh hmm—that could be true. I suspect that data curation is too important though, there are significant gains to be had by not including confusing data as positive examples. [Loading paper links...]
But things like pre-training with preferences should take care of that concern, no? Just mark good stuff with a magic good-stuff token, but allow the transformer to refine features for everything.
Yeah could be. I’m going to abstain from any further claims, I only have so much hunch fluid here