Quintin Pope comments on NVIDIA and Microsoft releases 530B parameter transformer model, Megatron-Turing NLG

Quintin Pope 11 Oct 2021 19:17 UTC
17 points
This seems to reflect a noticeable slowdown in the rate at which language model sizes increase. Compare the trend line you’d draw through the prior points to the one on the graph.
I’m still disappointed at the limited context window (2048 tokens). If you’re going to spend millions on training a transformer, may as well make it one of the linear time complexity variants.
It looks like Turing NLG models are autoregressive generative, like GPTs. So, not good at things like rephrasing text sections based on bidirectional context, but good at unidirectional language generation. I’m confused as to why everyone is focusing on unidirectional models. It seems like, if you want to provide a distinct service compared to your competition, bidirectionality would be the way to go. Then your model would be much better at things like evaluating text content, rephrasing, or grammar checking. Maybe the researchers want to be able to compare results with prior work?
- Xylitol 12 Oct 2021 0:13 UTC
  16 points
  Parent
  I’d hesitate to make predictions based on the slowdown of GPT-3 to Megatron-Turing, for two reasons.
  
  First, GPT-3 represents the fastest, largest increase in model size in this whole chart. If you only look at the models before GPT-3, the drawn trend line tracks well. Note how far off the trend GPT-3 itself is.
  
  Second, GPT-3 was released almost exactly when COVID became a serious concern in the world beyond China. I must imagine that this slowed down model development, but it will be less of a factor going forward.
  - Quintin Pope 12 Oct 2021 1:18 UTC
    3 points
    Parent
    If you just look at models before GPT-3, the trend line you’d draw is still noticeably steeper than the actual line on the graph. (ELMO and BERT large are below trend while T5 and Megatron 8.3B are above.) The new Megatron would represent the biggest trend line undershoot.
    
    Also, I think any post COVID speedup will be more than drown out by the recent slow down in the rate at which compute prices fall. They were dropping by an OOM every 4 years, but now it’s every 10-16 years.
    - avturchin 12 Oct 2021 11:42 UTC
      3 points
      Parent
      We miss GPT-4 data point for this chart. New Megatron is more like one more replication of GPT-3, and another one is new Chinese 245 Billion parameters model. But Google had trillion parameter model in fall 2020.
      - sanxiyn 12 Oct 2021 12:50 UTC
        3 points
        Parent
        Better citation for Chinese 245B model here: Yuan 1.0: Large-Scale Pre-trained Language Model in Zero-Shot and Few-Shot Learning.
      - sanxiyn 12 Oct 2021 12:49 UTC
        3 points
        Parent
        Google had trillion parameter model in fall 2020.
        That’s new to me. Any citation for this?
        tonyleary 20 Oct 2021 14:33 UTC
        3 points
        Parent
        https://arxiv.org/abs/2101.03961
        avturchin 12 Oct 2021 15:25 UTC
        2 points
        Parent
        It was not google, but Microsoft: in September 2020 they wrote: “The trillion-parameter model has 298 layers of Transformers with a hidden dimension of 17,408 and is trained with sequence length 2,048 and batch size 2,048.” https://www.microsoft.com/en-us/research/blog/deepspeed-extreme-scale-model-training-for-everyone/
        gwern 12 Oct 2021 17:06 UTC
        14 points
        Parent
        That was just the tech demo. It obviously wasn’t actually trained, just demoed for a few steps to show that the code works. If they had trained it back then, they’d, well, have announced it like OP! OP is what they’ve trained for-reals after further fiddling with that code.
    - Xylitol 12 Oct 2021 3:08 UTC
      1 point
      Parent
      I’m not sure how relevant the slowdown in compute price decrease is to this chart, since it starts in 2018 and the slowdown started 6-8 years ago; likewise, AlexNet, the breakout moment for deep learning, was 9 years ago. So if compute price is the primary rate-limiter, I’d think it would have a more gradual, consistent effect as models get bigger and bigger. The slowdown may mean that models cost quite a lot to train, but clearly huge companies like Nvidia and Microsoft haven’t started shying away yet from spending absurd amounts of money to keep growing their models.