Thanks for sharing your thoughts. As you already outlined, the report mentions at different occasions that the hardware forecasts are the least informed:
“Because they have not been the primary focus of my research, I consider these estimates unusually unstable, and expect that talking to a hardware expert could easily change my mind.”
This is partially the reason why I started looking into this a couple of months ago and still now on the side.
A couple of points come to mind:
I discuss the compute estimate side of the report a bit in my TAI and Compute series. Baseline is that I agree with your caveats and list some of the same plots. However, I also go into some reasons why those plots might not be that informative for the metric we care about.
Many compute trends plots assume peak performance based on the specs sheet or a specific benchmark (Graph500). This does not translate 1:1 to “AI computing capabilities” (let’s refer to them as effective FLOPs).
See a discussion on utilization in our estimate training compute piece and me ranting a bit on it in my appendix of TAI and compute.
I think the same caveat applies to the TOP500. I’d be interested in a Graph500 trend over time (Graph 500 is more about communication than pure processing capabilities).
Note that all of the reports and graphs usually refer to performance. Eventually, we’re interested in FLOPs/$.
What do you think about hardware getting cheaper? I summarize Cotra’s point here.
I don’t have a strong view here only “yeah seems plausible to me”.
Overall, there will either be room for improvement in chip design, or chip design will stabilize which enables the above outlined improvements in the economy of scale (learning curves). Consequently, if you believe that technological progress (more performance for the same price) might halt, the compute costs will continue decreasing, as we then get cheaper (same performance for a decreased price).
Overall, I think that you’re saying something “this can’t go on and the trend has already slowed down”. Whereas I think you’re pointing towards important trends, I’m somewhat optimistic that other hardware trends might be able to continue driving the progress in effective FLOP. E.g., most recently the interconnect (networking multiple GPUs and creating clusters). I think a more rigorous analysis of the last 10 years could already give some insights into which parts have been a driver of more effective FLOPs.
For this reason, I’m pretty excited about MLCommons benchmarks or something LambdaLabs—measuring the performance we might care about for AI.
Lastly, I’m working on better compute cost estimates and hoping to have something out in the next couple of months.
Thanks for sharing your thoughts. As you already outlined, the report mentions at different occasions that the hardware forecasts are the least informed:
This is partially the reason why I started looking into this a couple of months ago and still now on the side. A couple of points come to mind:
I discuss the compute estimate side of the report a bit in my TAI and Compute series. Baseline is that I agree with your caveats and list some of the same plots. However, I also go into some reasons why those plots might not be that informative for the metric we care about.
Many compute trends plots assume peak performance based on the specs sheet or a specific benchmark (Graph500). This does not translate 1:1 to “AI computing capabilities” (let’s refer to them as effective FLOPs). See a discussion on utilization in our estimate training compute piece and me ranting a bit on it in my appendix of TAI and compute.
I think the same caveat applies to the TOP500. I’d be interested in a Graph500 trend over time (Graph 500 is more about communication than pure processing capabilities).
Note that all of the reports and graphs usually refer to performance. Eventually, we’re interested in FLOPs/$.
Anecdotally, EleutherAI explicitly said that the interconnect was their bottleneck for training GPT-NeoX-20B.
What do you think about hardware getting cheaper? I summarize Cotra’s point here.
I don’t have a strong view here only “yeah seems plausible to me”.
Overall, I think that you’re saying something “this can’t go on and the trend has already slowed down”. Whereas I think you’re pointing towards important trends, I’m somewhat optimistic that other hardware trends might be able to continue driving the progress in effective FLOP. E.g., most recently the interconnect (networking multiple GPUs and creating clusters). I think a more rigorous analysis of the last 10 years could already give some insights into which parts have been a driver of more effective FLOPs.
For this reason, I’m pretty excited about MLCommons benchmarks or something LambdaLabs—measuring the performance we might care about for AI.
Lastly, I’m working on better compute cost estimates and hoping to have something out in the next couple of months.