My attempt at putting numbers on the total data out there, for those curious:
* 64,000 Weibo posts per minute x ~500k minutes per year x 10 years = ~3T tokens. I’d guess there are at least 10 social media sites this size, but this is super-sensitive data sharded across competing actors, so unless it’s a CCP-led consortium I think upperbounding this at 10T tokens seems reasonable.
* Let’s say that all of social media is about a tenth of the text contributed to the internet. Then Google’s scrape is ~300T, assuming the internet of the last decade is substantially larger than that of the preceding decades.
Let’s assume Chinchilla scaling laws still provide the correct total quantity of data needed. Chinchilla scaling laws suggest ~200T tokens for a 10T-param model, so this does indeed seem like it’s in-range for Google (due either to their scrape or to YouTube) and maybe a CCP-led consortium.
(There is also obviously the possibility of collecting new data, or of building lots of simulators.)
(Unclear whether anyone else similarly scraped the internet, or whether enough of it is still intact for scrapers to go in afterward.)
10M papers per year x 10,000 tokens per paper x 30 years = 3T tokens.
You raise the possibility that data quality might be important and that maybe “papers/theses” are higher quality than Chinchilla scaling laws identified on The Pile; I don’t really have a good intuition here.
I spent a little while trying to find upload numbers for the other video platforms, to no avail. Per Wikipedia, Twitch is the 3rd largest worldwide video platform (though this doesn’t count apps, esp. TikTok/Instagram). Twitch has an average of 100,000 streams going on at any given times x 3e8 tokens per video-year (x maybe 5 years) = 100T tokens, similar to YouTube. So this does convince me that there are probably a few more entities with this much video data.
I agree that if you put enough of these together, there are probably ~10 actors that can scrape together >200T tokens. This is important for coordination; it means the number of organizations that can play at this level will be heavily bottlenecked, potentially for years (until a bunch more data can be generated, which won’t be free). It seems to me that these ~10 actors are large highly-legible entities that are already well-known to the US or Chinese governments. This could be a meaningful lever for mitigating the race-dynamic fear that “even if we don’t do it, someone else will”, reducing everything to a 2 party US-China negotiation.
The big problem is the cold war mentality is back, and both sides will compete a lot more rather than cooperate. Combine this with a bit of an arms race by China and the US, and the chances for cooperation on existential risk are remote.
This is a separate discussion, but it is important to point out that the literal Cold War had the opposing powers cooperate on existential risk reduction. Granted that before that, two cities were burned to ash and we played apocalypse chicken in Cuba.
The specific upper bound does matter if we’re worried about superintelligence. If easy-to-get data instead capped out at 10 quadrillion tokens, it’d be easy to blow past 10T-param models; if we conveniently threshold around human-level params, we might be more likely to be dealing with “fast parallel Von Neumanns” than a basilisk, at least initially.
Just to register a prediction: I would be very surprised if photos have anywhere near as much information content as text/video, given their relative lack of long-term causal structure.
I wouldn’t go that far; using these systems to do recursive self-improvement via different learning paradigms (e.g. by designing simulators) could still get FOOM; it just seems less likely to me to happen by accident in the ordinary coarse of SSL training.
My attempt at putting numbers on the total data out there, for those curious:
* 64,000 Weibo posts per minute x ~500k minutes per year x 10 years = ~3T tokens. I’d guess there are at least 10 social media sites this size, but this is super-sensitive data sharded across competing actors, so unless it’s a CCP-led consortium I think upperbounding this at 10T tokens seems reasonable.
* Let’s say that all of social media is about a tenth of the text contributed to the internet. Then Google’s scrape is ~300T, assuming the internet of the last decade is substantially larger than that of the preceding decades.
* 500 hours of video uploaded to YouTube every minute x ~500k minutes per year x 10 years x 3600 seconds per hour x (my guess of) 10 tokens per second = ~90T tokens on YouTube.
* O(100 million) books published ever x100,000 tokens per book = 10T tokens, roughly.
Let’s assume Chinchilla scaling laws still provide the correct total quantity of data needed. Chinchilla scaling laws suggest ~200T tokens for a 10T-param model, so this does indeed seem like it’s in-range for Google (due either to their scrape or to YouTube) and maybe a CCP-led consortium.
(There is also obviously the possibility of collecting new data, or of building lots of simulators.)
(Unclear whether anyone else similarly scraped the internet, or whether enough of it is still intact for scrapers to go in afterward.)
That’s still just 20T tokens.
10M papers per year x 10,000 tokens per paper x 30 years = 3T tokens.
You raise the possibility that data quality might be important and that maybe “papers/theses” are higher quality than Chinchilla scaling laws identified on The Pile; I don’t really have a good intuition here.
I spent a little while trying to find upload numbers for the other video platforms, to no avail. Per Wikipedia, Twitch is the 3rd largest worldwide video platform (though this doesn’t count apps, esp. TikTok/Instagram). Twitch has an average of 100,000 streams going on at any given times x 3e8 tokens per video-year (x maybe 5 years) = 100T tokens, similar to YouTube. So this does convince me that there are probably a few more entities with this much video data.
I agree that if you put enough of these together, there are probably ~10 actors that can scrape together >200T tokens. This is important for coordination; it means the number of organizations that can play at this level will be heavily bottlenecked, potentially for years (until a bunch more data can be generated, which won’t be free). It seems to me that these ~10 actors are large highly-legible entities that are already well-known to the US or Chinese governments. This could be a meaningful lever for mitigating the race-dynamic fear that “even if we don’t do it, someone else will”, reducing everything to a 2 party US-China negotiation.
The big problem is the cold war mentality is back, and both sides will compete a lot more rather than cooperate. Combine this with a bit of an arms race by China and the US, and the chances for cooperation on existential risk are remote.
This is a separate discussion, but it is important to point out that the literal Cold War had the opposing powers cooperate on existential risk reduction. Granted that before that, two cities were burned to ash and we played apocalypse chicken in Cuba.
Two more points:
The specific upper bound does matter if we’re worried about superintelligence. If easy-to-get data instead capped out at 10 quadrillion tokens, it’d be easy to blow past 10T-param models; if we conveniently threshold around human-level params, we might be more likely to be dealing with “fast parallel Von Neumanns” than a basilisk, at least initially.
Just to register a prediction: I would be very surprised if photos have anywhere near as much information content as text/video, given their relative lack of long-term causal structure.
In short, while concerted effort could plausibly give us human intelligence, it is likely not to go superhuman and FOOM.
I wouldn’t go that far; using these systems to do recursive self-improvement via different learning paradigms (e.g. by designing simulators) could still get FOOM; it just seems less likely to me to happen by accident in the ordinary coarse of SSL training.