My model is that OpenAI and Anthropic researchers set up a web-scraper that reads through lots of popular internal reddit links (or possibly literally all of reddit) and then uses all of that as the training data for their language models.
...googling shows this as the official answer for GPT-3, which contains a lot of the popular and public internet. I am unclear whether that contains reddit, but if not then I believe I heard that they made a crawler specifically for reddit.
My model is that OpenAI and Anthropic researchers set up a web-scraper that reads through lots of popular internal reddit links (or possibly literally all of reddit) and then uses all of that as the training data for their language models.
...googling shows this as the official answer for GPT-3, which contains a lot of the popular and public internet. I am unclear whether that contains reddit, but if not then I believe I heard that they made a crawler specifically for reddit.
But are they going to do that again? GPT4 used the same training set as GPT3 didn’t it?
Ah, I was under a misapprehension, I thought the data was much more recent, but the GPT-4 page says:
However that is after GPT-3 was released (June 2020), so it’s a new dataset.
Extrapolating naively, 2 years from now we will see GPT-5 trained on data from today.