Ben Pace comments on Are there specific books that it might slightly help alignment to have on the internet?

Ben Pace 29 Mar 2023 6:07 UTC
6 points
0
My model is that OpenAI and Anthropic researchers set up a web-scraper that reads through lots of popular internal reddit links (or possibly literally all of reddit) and then uses all of that as the training data for their language models.
...googling shows this as the official answer for GPT-3, which contains a lot of the popular and public internet. I am unclear whether that contains reddit, but if not then I believe I heard that they made a crawler specifically for reddit.
- the gears to ascension 29 Mar 2023 6:15 UTC
  4 points
  0
  Parent
  But are they going to do that again? GPT4 used the same training set as GPT3 didn’t it?
  - Ben Pace 29 Mar 2023 6:20 UTC
    6 points
    0
    Parent
    Ah, I was under a misapprehension, I thought the data was much more recent, but the GPT-4 page says:
    GPT-4 generally lacks knowledge of events that have occurred after the vast majority of its data cuts off (September 2021)
    However that is after GPT-3 was released (June 2020), so it’s a new dataset.
    Extrapolating naively, 2 years from now we will see GPT-5 trained on data from today.