This is fantastic. Really appreciate both the detailed deep-dive in the document, and the summary here. This is also timely, given that teams working on superscale models with concerning capabilities haven’t generally been too forthcoming with compute estimates. (There are exceptions.)
As you and Alex point out in the sibling thread, the biggest remaining fudge factors seem to be:
Mixture models (or any kind of parameter-sharing, really) for the first method, which will cause you to systematically overestimate the “Operations per forward pass” factor; and
Variable effective utilization rates of custom hardware for the second method, which will cause an unknown distribution of errors in the “utilization rate” factor.
Nonetheless, my flying guess would be that your method is pretty much guaranteed to be right within an OOM, and probably within a factor of 2 or less. That seems pretty good! It’s certainly an improvement over anything I’ve seen previously along these lines. Congrats!
This is fantastic. Really appreciate both the detailed deep-dive in the document, and the summary here. This is also timely, given that teams working on superscale models with concerning capabilities haven’t generally been too forthcoming with compute estimates. (There are exceptions.)
As you and Alex point out in the sibling thread, the biggest remaining fudge factors seem to be:
Mixture models (or any kind of parameter-sharing, really) for the first method, which will cause you to systematically overestimate the “Operations per forward pass” factor; and
Variable effective utilization rates of custom hardware for the second method, which will cause an unknown distribution of errors in the “utilization rate” factor.
Nonetheless, my flying guess would be that your method is pretty much guaranteed to be right within an OOM, and probably within a factor of 2 or less. That seems pretty good! It’s certainly an improvement over anything I’ve seen previously along these lines. Congrats!