the gears to ascension comments on Are there specific books that it might slightly help alignment to have on the internet?

the gears to ascension 29 Mar 2023 5:19 UTC
4 points
0
merely being accessible online doesn’t get them in the training set of capabilities researchers’ AIs. Collecting books to contribute to LLM datasets seems like a good idea, but it’s ideologically loaded.
- Ben Pace 29 Mar 2023 5:43 UTC
  4 points
  −1
  Parent
  I think scraping reddit is common. The SSC subreddit is pretty popular. I wonder if there could be a post on that subreddit that was just a space for people to publish books in the comments.
  - the gears to ascension 29 Mar 2023 5:58 UTC
    2 points
    0
    Parent
    I feel like we have very different models of how people get their datasets. I’m pretty sure you’d have to just hand someone a dataset and say “here I downloaded some books for your agi kid to read”
    - Ben Pace 29 Mar 2023 6:07 UTC
      6 points
      0
      Parent
      My model is that OpenAI and Anthropic researchers set up a web-scraper that reads through lots of popular internal reddit links (or possibly literally all of reddit) and then uses all of that as the training data for their language models.
      ...googling shows this as the official answer for GPT-3, which contains a lot of the popular and public internet. I am unclear whether that contains reddit, but if not then I believe I heard that they made a crawler specifically for reddit.
      - the gears to ascension 29 Mar 2023 6:15 UTC
        4 points
        0
        Parent
        But are they going to do that again? GPT4 used the same training set as GPT3 didn’t it?
        Ben Pace 29 Mar 2023 6:20 UTC
        6 points
        0
        Parent
        Ah, I was under a misapprehension, I thought the data was much more recent, but the GPT-4 page says:
        GPT-4 generally lacks knowledge of events that have occurred after the vast majority of its data cuts off (September 2021)
        However that is after GPT-3 was released (June 2020), so it’s a new dataset.
        Extrapolating naively, 2 years from now we will see GPT-5 trained on data from today.
    - AnnaSalamon 29 Mar 2023 6:03 UTC
      4 points
      0
      Parent
      I was figuring GPT4 was already trained on a sizable fraction of the internet, and GPT5 would be trained on basically all the text (plus maybe some not-text, not sure). Is this wrong?
      - the gears to ascension 29 Mar 2023 6:05 UTC
        4 points
        0
        Parent
        Oh hmm—that could be true. I suspect that data curation is too important though, there are significant gains to be had by not including confusing data as positive examples. [Loading paper links...]
        Vladimir_Nesov 29 Mar 2023 6:21 UTC
        4 points
        0
        Parent
        
        significant gains to be had by not including confusing data
        
        But things like pre-training with preferences should take care of that concern, no? Just mark good stuff with a magic good-stuff token, but allow the transformer to refine features for everything.
        the gears to ascension 29 Mar 2023 6:44 UTC
        2 points
        0
        Parent
        Yeah could be. I’m going to abstain from any further claims, I only have so much hunch fluid here