We miss GPT-4 data point for this chart. New Megatron is more like one more replication of GPT-3, and another one is new Chinese 245 Billion parameters model. But Google had trillion parameter model in fall 2020.
That was just the tech demo. It obviously wasn’t actually trained, just demoed for a few steps to show that the code works. If they had trained it back then, they’d, well, have announced it like OP! OP is what they’ve trained for-reals after further fiddling with that code.
We miss GPT-4 data point for this chart. New Megatron is more like one more replication of GPT-3, and another one is new Chinese 245 Billion parameters model. But Google had trillion parameter model in fall 2020.
Better citation for Chinese 245B model here: Yuan 1.0: Large-Scale Pre-trained Language Model in Zero-Shot and Few-Shot Learning.
That’s new to me. Any citation for this?
https://arxiv.org/abs/2101.03961
It was not google, but Microsoft: in September 2020 they wrote: “The trillion-parameter model has 298 layers of Transformers with a hidden dimension of 17,408 and is trained with sequence length 2,048 and batch size 2,048.” https://www.microsoft.com/en-us/research/blog/deepspeed-extreme-scale-model-training-for-everyone/
That was just the tech demo. It obviously wasn’t actually trained, just demoed for a few steps to show that the code works. If they had trained it back then, they’d, well, have announced it like OP! OP is what they’ve trained for-reals after further fiddling with that code.