Large models: the largest version of Minerva is very large (540B parameters) and does 7% better than the 62B parameter model. It seems like it would be relatively expensive to continue improving performance solely by scaling up (but see below on undertraining).
This seems to be a silver lining—the comments on my posts about 10T Chinchilla models estimate these will not be economically feasible for at least 10 years. One thing I would like more insight into is if all the AI ASIC startups can be safely ignored. Hardware companies usually exaggerate to the point of mendacity, and so far their record has been a string of startups that NVIDIA has completely out-competed. But I would like to know how much probability I should have on some AI ASIC startup releasing something that makes training a 10-100T Chinchilla model feasible in less than 5 years.
Some of my “timelines” intuitions are coming from my perception of the rate of advancement in the last five years, but I am wondering if we’ve already eaten the easy scaling gains and things might be slightly slower for the next few years.
Hardware has a few dark horses: photonics, spiking neural nets on analogue hardware, and quantum computing are all paradigms that look great in theory and continue to fail in practice… but we know that these are the sorts of things which fail right up until they succeed, so who knows?
You should also consider that with the experience curves of tech & algorithms, and potential for much larger budgets, estimates like $20b (present-day costs) are not that absurd. (Consider that Elon Musk is currently on track to light a >$20b pile of cash on fire because he made an ill-considered impulse purchase online a few months ago. Not to be outdone, MBS continues to set $500b on fire by not building some city in the desert; he also announced $1b/year for longevity research, which should at least do a little bit better in terms of results.)
Yeah, it’s not really clear how to apply that specific kind of data pruning (straightforward for an image classifier) to the case of causally modelling text tokens in full context windows or any other dense task like that.
The big open question to me is, how much information is actually out there? I’ve heard a lot of speculation that text is probably information-densest, followed by photos, followed by videos. I haven’t heard anything about video games; I could see the argument being made that games are denser than text (since games frequently require navigating a dynamically adversarial environment). But I also don’t know that I’d expect the ~millions of existing games are actually that independent of each other. (Being man made, they’re a much less natural distribution than images, and they’ve all been generated for short-term human intelligence to solve).
We also run into the data redundancy question: all the video on the internet contains one set of information, and all the text another set, but these sets have huge overlap. (This is why multimodal models with pretrained language backends are so weirdly data-efficient.) How much “extra” novelty exists in all the auxiliary sources of info?
My attempt at putting numbers on the total data out there, for those curious:
* 64,000 Weibo posts per minute x ~500k minutes per year x 10 years = ~3T tokens. I’d guess there are at least 10 social media sites this size, but this is super-sensitive data sharded across competing actors, so unless it’s a CCP-led consortium I think upperbounding this at 10T tokens seems reasonable.
* Let’s say that all of social media is about a tenth of the text contributed to the internet. Then Google’s scrape is ~300T, assuming the internet of the last decade is substantially larger than that of the preceding decades.
Let’s assume Chinchilla scaling laws still provide the correct total quantity of data needed. Chinchilla scaling laws suggest ~200T tokens for a 10T-param model, so this does indeed seem like it’s in-range for Google (due either to their scrape or to YouTube) and maybe a CCP-led consortium.
(There is also obviously the possibility of collecting new data, or of building lots of simulators.)
(Unclear whether anyone else similarly scraped the internet, or whether enough of it is still intact for scrapers to go in afterward.)
10M papers per year x 10,000 tokens per paper x 30 years = 3T tokens.
You raise the possibility that data quality might be important and that maybe “papers/theses” are higher quality than Chinchilla scaling laws identified on The Pile; I don’t really have a good intuition here.
I spent a little while trying to find upload numbers for the other video platforms, to no avail. Per Wikipedia, Twitch is the 3rd largest worldwide video platform (though this doesn’t count apps, esp. TikTok/Instagram). Twitch has an average of 100,000 streams going on at any given times x 3e8 tokens per video-year (x maybe 5 years) = 100T tokens, similar to YouTube. So this does convince me that there are probably a few more entities with this much video data.
I agree that if you put enough of these together, there are probably ~10 actors that can scrape together >200T tokens. This is important for coordination; it means the number of organizations that can play at this level will be heavily bottlenecked, potentially for years (until a bunch more data can be generated, which won’t be free). It seems to me that these ~10 actors are large highly-legible entities that are already well-known to the US or Chinese governments. This could be a meaningful lever for mitigating the race-dynamic fear that “even if we don’t do it, someone else will”, reducing everything to a 2 party US-China negotiation.
The big problem is the cold war mentality is back, and both sides will compete a lot more rather than cooperate. Combine this with a bit of an arms race by China and the US, and the chances for cooperation on existential risk are remote.
This is a separate discussion, but it is important to point out that the literal Cold War had the opposing powers cooperate on existential risk reduction. Granted that before that, two cities were burned to ash and we played apocalypse chicken in Cuba.
The specific upper bound does matter if we’re worried about superintelligence. If easy-to-get data instead capped out at 10 quadrillion tokens, it’d be easy to blow past 10T-param models; if we conveniently threshold around human-level params, we might be more likely to be dealing with “fast parallel Von Neumanns” than a basilisk, at least initially.
Just to register a prediction: I would be very surprised if photos have anywhere near as much information content as text/video, given their relative lack of long-term causal structure.
I wouldn’t go that far; using these systems to do recursive self-improvement via different learning paradigms (e.g. by designing simulators) could still get FOOM; it just seems less likely to me to happen by accident in the ordinary coarse of SSL training.
This seems to be a silver lining—the comments on my posts about 10T Chinchilla models estimate these will not be economically feasible for at least 10 years. One thing I would like more insight into is if all the AI ASIC startups can be safely ignored. Hardware companies usually exaggerate to the point of mendacity, and so far their record has been a string of startups that NVIDIA has completely out-competed. But I would like to know how much probability I should have on some AI ASIC startup releasing something that makes training a 10-100T Chinchilla model feasible in less than 5 years.
Some of my “timelines” intuitions are coming from my perception of the rate of advancement in the last five years, but I am wondering if we’ve already eaten the easy scaling gains and things might be slightly slower for the next few years.
Hardware has a few dark horses: photonics, spiking neural nets on analogue hardware, and quantum computing are all paradigms that look great in theory and continue to fail in practice… but we know that these are the sorts of things which fail right up until they succeed, so who knows?
You should also consider that with the experience curves of tech & algorithms, and potential for much larger budgets, estimates like $20b (present-day costs) are not that absurd. (Consider that Elon Musk is currently on track to light a >$20b pile of cash on fire because he made an ill-considered impulse purchase online a few months ago. Not to be outdone, MBS continues to set $500b on fire by not building some city in the desert; he also announced $1b/year for longevity research, which should at least do a little bit better in terms of results.)
Yeah, it’s not really clear how to apply that specific kind of data pruning (straightforward for an image classifier) to the case of causally modelling text tokens in full context windows or any other dense task like that.
The big open question to me is, how much information is actually out there? I’ve heard a lot of speculation that text is probably information-densest, followed by photos, followed by videos. I haven’t heard anything about video games; I could see the argument being made that games are denser than text (since games frequently require navigating a dynamically adversarial environment). But I also don’t know that I’d expect the ~millions of existing games are actually that independent of each other. (Being man made, they’re a much less natural distribution than images, and they’ve all been generated for short-term human intelligence to solve).
We also run into the data redundancy question: all the video on the internet contains one set of information, and all the text another set, but these sets have huge overlap. (This is why multimodal models with pretrained language backends are so weirdly data-efficient.) How much “extra” novelty exists in all the auxiliary sources of info?
My attempt at putting numbers on the total data out there, for those curious:
* 64,000 Weibo posts per minute x ~500k minutes per year x 10 years = ~3T tokens. I’d guess there are at least 10 social media sites this size, but this is super-sensitive data sharded across competing actors, so unless it’s a CCP-led consortium I think upperbounding this at 10T tokens seems reasonable.
* Let’s say that all of social media is about a tenth of the text contributed to the internet. Then Google’s scrape is ~300T, assuming the internet of the last decade is substantially larger than that of the preceding decades.
* 500 hours of video uploaded to YouTube every minute x ~500k minutes per year x 10 years x 3600 seconds per hour x (my guess of) 10 tokens per second = ~90T tokens on YouTube.
* O(100 million) books published ever x100,000 tokens per book = 10T tokens, roughly.
Let’s assume Chinchilla scaling laws still provide the correct total quantity of data needed. Chinchilla scaling laws suggest ~200T tokens for a 10T-param model, so this does indeed seem like it’s in-range for Google (due either to their scrape or to YouTube) and maybe a CCP-led consortium.
(There is also obviously the possibility of collecting new data, or of building lots of simulators.)
(Unclear whether anyone else similarly scraped the internet, or whether enough of it is still intact for scrapers to go in afterward.)
That’s still just 20T tokens.
10M papers per year x 10,000 tokens per paper x 30 years = 3T tokens.
You raise the possibility that data quality might be important and that maybe “papers/theses” are higher quality than Chinchilla scaling laws identified on The Pile; I don’t really have a good intuition here.
I spent a little while trying to find upload numbers for the other video platforms, to no avail. Per Wikipedia, Twitch is the 3rd largest worldwide video platform (though this doesn’t count apps, esp. TikTok/Instagram). Twitch has an average of 100,000 streams going on at any given times x 3e8 tokens per video-year (x maybe 5 years) = 100T tokens, similar to YouTube. So this does convince me that there are probably a few more entities with this much video data.
I agree that if you put enough of these together, there are probably ~10 actors that can scrape together >200T tokens. This is important for coordination; it means the number of organizations that can play at this level will be heavily bottlenecked, potentially for years (until a bunch more data can be generated, which won’t be free). It seems to me that these ~10 actors are large highly-legible entities that are already well-known to the US or Chinese governments. This could be a meaningful lever for mitigating the race-dynamic fear that “even if we don’t do it, someone else will”, reducing everything to a 2 party US-China negotiation.
The big problem is the cold war mentality is back, and both sides will compete a lot more rather than cooperate. Combine this with a bit of an arms race by China and the US, and the chances for cooperation on existential risk are remote.
This is a separate discussion, but it is important to point out that the literal Cold War had the opposing powers cooperate on existential risk reduction. Granted that before that, two cities were burned to ash and we played apocalypse chicken in Cuba.
Two more points:
The specific upper bound does matter if we’re worried about superintelligence. If easy-to-get data instead capped out at 10 quadrillion tokens, it’d be easy to blow past 10T-param models; if we conveniently threshold around human-level params, we might be more likely to be dealing with “fast parallel Von Neumanns” than a basilisk, at least initially.
Just to register a prediction: I would be very surprised if photos have anywhere near as much information content as text/video, given their relative lack of long-term causal structure.
In short, while concerted effort could plausibly give us human intelligence, it is likely not to go superhuman and FOOM.
I wouldn’t go that far; using these systems to do recursive self-improvement via different learning paradigms (e.g. by designing simulators) could still get FOOM; it just seems less likely to me to happen by accident in the ordinary coarse of SSL training.