The opportunities for algorithmic improvements go far beyond the parallelization and mixture of experts methods you mention.
I agree. I’d be very interested in anyone’s forecasts for how they might evolve.
I’ve been working with (very roughly) another ~10x or so improvement in “inference efficiency” by 2030 (or how to measure this and make sure it’s independent from other factors).
By this I mean that if we were able to train a model with 10^26 FLOP this year, achieving a fixed level of learning efficiency, it would require 10X FLOP to generate useful output, while by 2030 it would only require X FLOP to get the same output.
I agree. I’d be very interested in anyone’s forecasts for how they might evolve.
I’ve been working with (very roughly) another ~10x or so improvement in “inference efficiency” by 2030 (or how to measure this and make sure it’s independent from other factors).
By this I mean that if we were able to train a model with 10^26 FLOP this year, achieving a fixed level of learning efficiency, it would require 10X FLOP to generate useful output, while by 2030 it would only require X FLOP to get the same output.