avturchin comments on NVIDIA and Microsoft releases 530B parameter transformer model, Megatron-Turing NLG

avturchin 12 Oct 2021 15:25 UTC
2 points
It was not google, but Microsoft: in September 2020 they wrote: “The trillion-parameter model has 298 layers of Transformers with a hidden dimension of 17,408 and is trained with sequence length 2,048 and batch size 2,048.” https://www.microsoft.com/en-us/research/blog/deepspeed-extreme-scale-model-training-for-everyone/
- gwern 12 Oct 2021 17:06 UTC
  14 points
  Parent
  That was just the tech demo. It obviously wasn’t actually trained, just demoed for a few steps to show that the code works. If they had trained it back then, they’d, well, have announced it like OP! OP is what they’ve trained for-reals after further fiddling with that code.