I want to briefly preface this post by saying that this is my first on LW and I have no professional background in machine learning—I monitor the AI page as a sort of early-warning radar for AGI—so bear that in mind and feel free to correct any errors here.
With that out of the way, I did a back-of-the-envelope calculation for the amount of compute used to train the model, based on the numbers given in the NVIDIA post:
280 DGX A100 servers x 8 GPUs/server x 126 teraFLOP/GPU/second x 60.1 seconds/batch = 1.70 x 10^7 teraFLOP/batch = 17,000 petaFLOP/batch
271 billion tokens / 2048 tokens/sequence / 1920 sequences/batch = 68,919 batches
17,000 petaFLOP/batch x 68,919 batches = 1.17 x 10^9 petaFLOP
1.17 x 10^9 petaFLOP = 1.17 x 10^9 petaFLOPs seconds = 13,542 petaFLOPs days
That’s ~3.72 times the 3640 petaFLOPs days used to train GPT-3, so compute cost seems to be scaling slightly supralinearly with parameter count, although I’m not sure how much I’d read into that given the different architectures, token counts, etc.
Aside from the number of parameters, there doesn’t seem to be much novelty here, just a scaled-up version of previous language models. It’s not multimodal, either, despite that seeming to be the direction the field is moving in.
My tentative take on this is that it’s more of a hardware showcase for NVIDIA than an attempt to make a bold leap forward in deep learning. NVIDIA gets to show off the power of their tech to train huge neural nets, and Microsoft gets a modestly more powerful language model to work with.
As I said before, please let me know if there are any mistakes in my calculation or if you disagree with my assessment—I’d be interested to get feedback.
Thanks—this is very interesting