It seems like you probably could have gotten certainty about compute for at least a handful of the models studied in question
We thought so too—but in practice it has been surprisingly hard. Profilers are surprisingly buggy. Our colleague Marious looked into this more in depth here.
Maybe we are just going the wrong way about it. If someone here figures out how to directly measure compute in eg a pytorch or TF model it would be a huge boon to us.
I think two more contemporary techniques are worth considering here: structured sparsity in weights (‘blocksparse’), and mixture-of-experts gating (‘switch transformer’)
Great suggestions! I think those would be a great future caveats to look into.
I’d be curious to hear the authors’ expectations of how this research changes in the face of more custom ML hardware.
My naive impression is that our conclusions do not change much. You would just need to plug into the effective performance (peak performance×utilization) in the second formula.
Probably the trickiest part might be figuring out the utilization rate for the custom hardware—though this is a general problem with the second method.
In general I think it’d be good to integrate a bunch of the performance benchmarks that are publicly available (since hardware providers are usually pretty eager to show off stats that make their hardware look good) into calibrations for this method.
I think that would be nice! We started a public spreadsheet with some info on different hardware. This might be of help to someone who wants to dig deeper into the topic!
Thank you Alex! You make some great points.
We thought so too—but in practice it has been surprisingly hard. Profilers are surprisingly buggy. Our colleague Marious looked into this more in depth here.
Maybe we are just going the wrong way about it. If someone here figures out how to directly measure compute in eg a pytorch or TF model it would be a huge boon to us.
Great suggestions! I think those would be a great future caveats to look into.
My naive impression is that our conclusions do not change much. You would just need to plug into the effective performance (peak performance×utilization) in the second formula.
Probably the trickiest part might be figuring out the utilization rate for the custom hardware—though this is a general problem with the second method.
I think that would be nice! We started a public spreadsheet with some info on different hardware. This might be of help to someone who wants to dig deeper into the topic!