gwern comments on Forecasting ML Benchmarks in 2023

gwern 20 Jul 2022 1:18 UTC
19 points
4
- Not Relevant 20 Jul 2022 16:45 UTC
  3 points
  0
  Parent
  My attempt at putting numbers on the total data out there, for those curious:
  * 64,000 Weibo posts per minute x ~500k minutes per year x 10 years = ~3T tokens. I’d guess there are at least 10 social media sites this size, but this is super-sensitive data sharded across competing actors, so unless it’s a CCP-led consortium I think upperbounding this at 10T tokens seems reasonable.
  * Let’s say that all of social media is about a tenth of the text contributed to the internet. Then Google’s scrape is ~300T, assuming the internet of the last decade is substantially larger than that of the preceding decades.
  * 500 hours of video uploaded to YouTube every minute x ~500k minutes per year x 10 years x 3600 seconds per hour x (my guess of) 10 tokens per second = ~90T tokens on YouTube.
  * O(100 million) books published ever x100,000 tokens per book = 10T tokens, roughly.
  Let’s assume Chinchilla scaling laws still provide the correct total quantity of data needed. Chinchilla scaling laws suggest ~200T tokens for a 10T-param model, so this does indeed seem like it’s in-range for Google (due either to their scrape or to YouTube) and maybe a CCP-led consortium.
  (There is also obviously the possibility of collecting new data, or of building lots of simulators.)
  (Unclear whether anyone else similarly scraped the internet, or whether enough of it is still intact for scrapers to go in afterward.)
  - gwern 20 Jul 2022 17:03 UTC
    6 points
    2
    Parent
    - Not Relevant 20 Jul 2022 17:32 UTC
      3 points
      0
      Parent
      180m books now
      That’s still just 20T tokens.
      academic papers/theses are a few mill a year too
      10M papers per year x 10,000 tokens per paper x 30 years = 3T tokens.
      You raise the possibility that data quality might be important and that maybe “papers/theses” are higher quality than Chinchilla scaling laws identified on The Pile; I don’t really have a good intuition here.
      I spent a little while trying to find upload numbers for the other video platforms, to no avail. Per Wikipedia, Twitch is the 3rd largest worldwide video platform (though this doesn’t count apps, esp. TikTok/Instagram). Twitch has an average of 100,000 streams going on at any given times x 3e8 tokens per video-year (x maybe 5 years) = 100T tokens, similar to YouTube. So this does convince me that there are probably a few more entities with this much video data.
      - gwern 20 Jul 2022 18:20 UTC
        4 points
        2
        Parent
        Not Relevant 20 Jul 2022 18:40 UTC
        2 points
        1
        Parent
        I agree that if you put enough of these together, there are probably ~10 actors that can scrape together >200T tokens. This is important for coordination; it means the number of organizations that can play at this level will be heavily bottlenecked, potentially for years (until a bunch more data can be generated, which won’t be free). It seems to me that these ~10 actors are large highly-legible entities that are already well-known to the US or Chinese governments. This could be a meaningful lever for mitigating the race-dynamic fear that “even if we don’t do it, someone else will”, reducing everything to a 2 party US-China negotiation.
        Noosphere89 20 Jul 2022 20:43 UTC
        1 point
        0
        Parent
        The big problem is the cold war mentality is back, and both sides will compete a lot more rather than cooperate. Combine this with a bit of an arms race by China and the US, and the chances for cooperation on existential risk are remote.
        Not Relevant 20 Jul 2022 20:51 UTC
        1 point
        0
        Parent
        This is a separate discussion, but it is important to point out that the literal Cold War had the opposing powers cooperate on existential risk reduction. Granted that before that, two cities were burned to ash and we played apocalypse chicken in Cuba.
        Not Relevant 20 Jul 2022 20:27 UTC
        1 point
        0
        Parent
        Two more points:
        
        The specific upper bound does matter if we’re worried about superintelligence. If easy-to-get data instead capped out at 10 quadrillion tokens, it’d be easy to blow past 10T-param models; if we conveniently threshold around human-level params, we might be more likely to be dealing with “fast parallel Von Neumanns” than a basilisk, at least initially.
        Just to register a prediction: I would be very surprised if photos have anywhere near as much information content as text/video, given their relative lack of long-term causal structure.
        Noosphere89 20 Jul 2022 20:41 UTC
        1 point
        0
        Parent
        In short, while concerted effort could plausibly give us human intelligence, it is likely not to go superhuman and FOOM.
        Not Relevant 20 Jul 2022 20:53 UTC
        1 point
        0
        Parent
        I wouldn’t go that far; using these systems to do recursive self-improvement via different learning paradigms (e.g. by designing simulators) could still get FOOM; it just seems less likely to me to happen by accident in the ordinary coarse of SSL training.
- [ ]
  [deleted]