Thanks that’s interesting!
Can I double check, do you think this affects the bottom lines?
The bottom line is supposed to be that FLOP/s vs. FLOP per forward pass can be used as an upper bound, and memory bandwidth vs. model size can be used as an lower bound, and real life efficiency falls somewhere in the middle depending on a many factors (inc. length of KV cache), which I don’t try to get into, but is plausibly around 15% of the upper bound for GPT-4 on H100s.
Are you saying that the lower bound for output tokens should maybe be even lower, because the KV cache can be larger than the model weights?
I agree. I’d be very interested in anyone’s forecasts for how they might evolve.
I’ve been working with (very roughly) another ~10x or so improvement in “inference efficiency” by 2030 (or how to measure this and make sure it’s independent from other factors).
By this I mean that if we were able to train a model with 10^26 FLOP this year, achieving a fixed level of learning efficiency, it would require 10X FLOP to generate useful output, while by 2030 it would only require X FLOP to get the same output.