Quintin Pope comments on NVIDIA and Microsoft releases 530B parameter transformer model, Megatron-Turing NLG

Quintin Pope 12 Oct 2021 1:18 UTC
3 points
If you just look at models before GPT-3, the trend line you’d draw is still noticeably steeper than the actual line on the graph. (ELMO and BERT large are below trend while T5 and Megatron 8.3B are above.) The new Megatron would represent the biggest trend line undershoot.

Also, I think any post COVID speedup will be more than drown out by the recent slow down in the rate at which compute prices fall. They were dropping by an OOM every 4 years, but now it’s every 10-16 years.
- avturchin 12 Oct 2021 11:42 UTC
  3 points
  Parent
  We miss GPT-4 data point for this chart. New Megatron is more like one more replication of GPT-3, and another one is new Chinese 245 Billion parameters model. But Google had trillion parameter model in fall 2020.
  - sanxiyn 12 Oct 2021 12:50 UTC
    3 points
    Parent
    Better citation for Chinese 245B model here: Yuan 1.0: Large-Scale Pre-trained Language Model in Zero-Shot and Few-Shot Learning.
  - sanxiyn 12 Oct 2021 12:49 UTC
    3 points
    Parent
    Google had trillion parameter model in fall 2020.
    That’s new to me. Any citation for this?
    - tonyleary 20 Oct 2021 14:33 UTC
      3 points
      Parent
      https://arxiv.org/abs/2101.03961
    - avturchin 12 Oct 2021 15:25 UTC
      2 points
      Parent
      It was not google, but Microsoft: in September 2020 they wrote: “The trillion-parameter model has 298 layers of Transformers with a hidden dimension of 17,408 and is trained with sequence length 2,048 and batch size 2,048.” https://www.microsoft.com/en-us/research/blog/deepspeed-extreme-scale-model-training-for-everyone/
      - gwern 12 Oct 2021 17:06 UTC
        14 points
        Parent
        That was just the tech demo. It obviously wasn’t actually trained, just demoed for a few steps to show that the code works. If they had trained it back then, they’d, well, have announced it like OP! OP is what they’ve trained for-reals after further fiddling with that code.
- Xylitol 12 Oct 2021 3:08 UTC
  1 point
  Parent
  I’m not sure how relevant the slowdown in compute price decrease is to this chart, since it starts in 2018 and the slowdown started 6-8 years ago; likewise, AlexNet, the breakout moment for deep learning, was 9 years ago. So if compute price is the primary rate-limiter, I’d think it would have a more gradual, consistent effect as models get bigger and bigger. The slowdown may mean that models cost quite a lot to train, but clearly huge companies like Nvidia and Microsoft haven’t started shying away yet from spending absurd amounts of money to keep growing their models.