People would ask things like “what would it cost (in compute spending) to train a 10T parameter Chinchilla?”, which is a bizarre way to frame things if you grok what Chinchilla is.
That wasn’t an alignment researcher, though (was it? I thought Tomás was just an interested commenter), and it’s a reasonable question to ask when no one’s run the numbers, and when you get an answer like ‘well, it’d take something like >5000x more compute than PaLM’, that’s a lesson learned.
At least among the people I’ve talked to, it seems reasonably well understood that Chinchilla had major implications, meant an immediate capabilities jump and cheaper deployment, and even more importantly meant parameter scaling was dead, and data and then compute were the bottleneck (which is also what I’ve said bluntly in my earlier comments), and this was why Chinchilla was more important than more splashy stuff like PaLM*. (One capability researcher, incidentally, wasn’t revising plans but that’s because he wasn’t convinced Chinchilla was right in the first place! AFAIK, there has been no dramatic followup to Chinchilla on part with GPT-3 following up Kaplan et al, and in fact, no one has replicated Chinchilla at all, much less run a full scaling law sweep and inferred similar scaling laws, so there is still some doubt there about how real Chinchilla is or how accurate or generalizable its scaling laws are—quite aside from the usual issues like hilariously vague descriptions of datasets.)
I also agree with Tom that if one had thoughts about Chinchilla and data sampling and brand new scaling dynamics catapulting immediately into arms races, it is increasingly approaching the point where a reasonable person might decide to move discussions to more private channels, and for that reason the public discussions of Chinchilla might be very basic and of the ‘could we train a 10t parameter Chinchilla’ sort.
* PaLM and DALL-E 2 etc helping down out Chinchilla is an example of what I’ve referred to about how the boom-bust clustering of DL research publications can be quite harmful to discussions.
That wasn’t an alignment researcher, though (was it? I thought Tomás was just an interested commenter), and it’s a reasonable question to ask when no one’s run the numbers, and when you get an answer like ‘well, it’d take something like >5000x more compute than PaLM’, that’s a lesson learned.
At least among the people I’ve talked to, it seems reasonably well understood that Chinchilla had major implications, meant an immediate capabilities jump and cheaper deployment, and even more importantly meant parameter scaling was dead, and data and then compute were the bottleneck (which is also what I’ve said bluntly in my earlier comments), and this was why Chinchilla was more important than more splashy stuff like PaLM*. (One capability researcher, incidentally, wasn’t revising plans but that’s because he wasn’t convinced Chinchilla was right in the first place! AFAIK, there has been no dramatic followup to Chinchilla on part with GPT-3 following up Kaplan et al, and in fact, no one has replicated Chinchilla at all, much less run a full scaling law sweep and inferred similar scaling laws, so there is still some doubt there about how real Chinchilla is or how accurate or generalizable its scaling laws are—quite aside from the usual issues like hilariously vague descriptions of datasets.)
I also agree with Tom that if one had thoughts about Chinchilla and data sampling and brand new scaling dynamics catapulting immediately into arms races, it is increasingly approaching the point where a reasonable person might decide to move discussions to more private channels, and for that reason the public discussions of Chinchilla might be very basic and of the ‘could we train a 10t parameter Chinchilla’ sort.
* PaLM and DALL-E 2 etc helping down out Chinchilla is an example of what I’ve referred to about how the boom-bust clustering of DL research publications can be quite harmful to discussions.
Yep. Just an interested layman.