it’s estimated that the efficiency of algorithms has improved about 3x/year
There was about 5x increase since GPT-3 for dense transformers (see Figure 4) and then there’s MoE, so assuming GPT-3 is not much better than the 2017 baseline after anyone seriously bothered to optimize, it’s more like 30% per year, though plausibly slower recently.
The relevant Epoch paper says point estimate for compute efficiency doubling is 8-9 months (Section 3.1, Appendix G), about 2.5x/year. Though I can’t make sense of their methodology, which aims to compare the incomparable. In particular, what good is comparing even transformers without following the Chinchilla protocol (finding minima on isoFLOP plots of training runs with individually optimal learning rates, not continued pre-training with suboptimal learning rates at many points). Not to mention non-transformers where the scaling laws won’t match and so the results of comparison change as we vary the scale, and also many older algorithms probably won’t scale to arbitrary compute at all.
(With JavaScript mostly disabled, the page you linked lists “Compute-efficiency in language models” as 5.1%/year (!!!). After JavaScript is sufficiently enabled, it starts saying “3 ÷/year”, with a ‘÷’ character, though “90% confidence interval: 2 times to 6 times” disambiguates it. In other places on the same page there are figures like “2.4 x/year” with the more standard ‘x’ character for this meaning.)
There was about 5x increase since GPT-3 for dense transformers (see Figure 4) and then there’s MoE, so assuming GPT-3 is not much better than the 2017 baseline after anyone seriously bothered to optimize, it’s more like 30% per year, though plausibly slower recently.
The relevant Epoch paper says point estimate for compute efficiency doubling is 8-9 months (Section 3.1, Appendix G), about 2.5x/year. Though I can’t make sense of their methodology, which aims to compare the incomparable. In particular, what good is comparing even transformers without following the Chinchilla protocol (finding minima on isoFLOP plots of training runs with individually optimal learning rates, not continued pre-training with suboptimal learning rates at many points). Not to mention non-transformers where the scaling laws won’t match and so the results of comparison change as we vary the scale, and also many older algorithms probably won’t scale to arbitrary compute at all.
(With JavaScript mostly disabled, the page you linked lists “Compute-efficiency in language models” as 5.1%/year (!!!). After JavaScript is sufficiently enabled, it starts saying “3 ÷/year”, with a ‘÷’ character, though “90% confidence interval: 2 times to 6 times” disambiguates it. In other places on the same page there are figures like “2.4 x/year” with the more standard ‘x’ character for this meaning.)