Thanks so much for this update! Some quick questions:
Are you still estimating that the transformative model uses probably about 1e16 parameters & 1e16 flops? IMO something more like 1e13 is more reasonable.
Are you still estimating that algorithmic efficiency doubles every 2.5 years (for now at least, until R&D acceleration kicks in?) I’ve heard from thers (e.g. Jaime Sevilla) that more recent data suggests it’s doubling every 1 year currently.
Do you still update against the lower end of training FLOP requirements, on the grounds that if we were 1-4 OOMs away right now the world would look very different?
Is there an updated spreadsheet we can play around with?
2. Are you still estimating that algorithmic efficiency doubles every 2.5 years (for now at least, until R&D acceleration kicks in?) I’ve heard from thers (e.g. Jaime Sevilla) that more recent data suggests it’s doubling every 1 year currently.
It seems like the only source on this is Hernandez & Brown 2020. Their main finding is a doubling time of 16 months for AlexNet-level performance on ImageNet: “the number of floating point operations required to train a classifier to AlexNet-level performance on ImageNet has decreased by a factor of 44x between 2012 and 2019. This corresponds to algorithmic efficiency doubling every 16 months over a period of 7 years.”
They also find faster doubling times for some Transformers and RL systems, as shown here:
This is notably faster algorithmic progress than the 2.5 year doubling time used in Ajeya’s report, though I do somewhat agree with her justification for a more conservative estimate:
Additionally, it seems plausible to me that both sets of results would overestimate the pace of algorithmic progress on a transformative task, because they are both focusing on relatively narrow problems with simple, well-defined benchmarks that large groups of researchers could directly optimize. Because no one has trained a transformative model yet, to the extent that the computation required to train one is falling over time, it would have to happen via proxies rather than researchers directly optimizing that metric (e.g. perhaps architectural innovations that improve training efficiency for image classifiers or language models would translate to a transformative model). Additionally, it may be that halving the amount of computation required to train a transformative model would require making progress on multiple partially-independent sub-problems (e.g. vision and language and motor control).
I have attempted to take the Hernandez and Brown 2020 halving times (and Paul’s summary of the Grace 2013 halving times) as anchoring points and shade them upward to account for the considerations raised above. There is massive room for judgment in whether and how much to shade upward; I expect many readers will want to change my assumptions here, and some will believe it is more reasonable to shade downward.
Curious to read any other papers on this topic. More research benchmarking algorithmic gains seems tractable and if anybody has a well-scoped question I might also be interested in doing that research.
As opposed to what, linear? Or s-curvy? S-curves look exponential until you get close to the theoretical limit. I doubt we are close to the theoretical limits.
Ajeya bases her estimate on empirical data, so if you want to see whether it’s exponential go look at that I guess.
Thanks so much for this update! Some quick questions:
Are you still estimating that the transformative model uses probably about 1e16 parameters & 1e16 flops? IMO something more like 1e13 is more reasonable.
Are you still estimating that algorithmic efficiency doubles every 2.5 years (for now at least, until R&D acceleration kicks in?) I’ve heard from thers (e.g. Jaime Sevilla) that more recent data suggests it’s doubling every 1 year currently.
Do you still update against the lower end of training FLOP requirements, on the grounds that if we were 1-4 OOMs away right now the world would look very different?
Is there an updated spreadsheet we can play around with?
Somehow you managed to be terrifying while only asking questions.
It seems like the only source on this is Hernandez & Brown 2020. Their main finding is a doubling time of 16 months for AlexNet-level performance on ImageNet: “the number of floating point operations required to train a classifier to AlexNet-level performance on ImageNet has decreased by a factor of 44x between 2012 and 2019. This corresponds to algorithmic efficiency doubling every 16 months over a period of 7 years.”
They also find faster doubling times for some Transformers and RL systems, as shown here:
This is notably faster algorithmic progress than the 2.5 year doubling time used in Ajeya’s report, though I do somewhat agree with her justification for a more conservative estimate:
Curious to read any other papers on this topic. More research benchmarking algorithmic gains seems tractable and if anybody has a well-scoped question I might also be interested in doing that research.
Is there reason to believe algorithmic improvements follow an exponential curve? Do you happen to know a good source on this?
As opposed to what, linear? Or s-curvy? S-curves look exponential until you get close to the theoretical limit. I doubt we are close to the theoretical limits.
Ajeya bases her estimate on empirical data, so if you want to see whether it’s exponential go look at that I guess.