This seems to reflect a noticeable slowdown in the rate at which language model sizes increase. Compare the trend line you’d draw through the prior points to the one on the graph.
I’m still disappointed at the limited context window (2048 tokens). If you’re going to spend millions on training a transformer, may as well make it one of the linear time complexity variants.
It looks like Turing NLG models are autoregressive generative, like GPTs. So, not good at things like rephrasing text sections based on bidirectional context, but good at unidirectional language generation. I’m confused as to why everyone is focusing on unidirectional models. It seems like, if you want to provide a distinct service compared to your competition, bidirectionality would be the way to go. Then your model would be much better at things like evaluating text content, rephrasing, or grammar checking. Maybe the researchers want to be able to compare results with prior work?
I’d hesitate to make predictions based on the slowdown of GPT-3 to Megatron-Turing, for two reasons.
First, GPT-3 represents the fastest, largest increase in model size in this whole chart. If you only look at the models before GPT-3, the drawn trend line tracks well. Note how far off the trend GPT-3 itself is.
Second, GPT-3 was released almost exactly when COVID became a serious concern in the world beyond China. I must imagine that this slowed down model development, but it will be less of a factor going forward.
If you just look at models before GPT-3, the trend line you’d draw is still noticeably steeper than the actual line on the graph. (ELMO and BERT large are below trend while T5 and Megatron 8.3B are above.) The new Megatron would represent the biggest trend line undershoot.
We miss GPT-4 data point for this chart. New Megatron is more like one more replication of GPT-3, and another one is new Chinese 245 Billion parameters model. But Google had trillion parameter model in fall 2020.
That was just the tech demo. It obviously wasn’t actually trained, just demoed for a few steps to show that the code works. If they had trained it back then, they’d, well, have announced it like OP! OP is what they’ve trained for-reals after further fiddling with that code.
I’m not sure how relevant the slowdown in compute price decrease is to this chart, since it starts in 2018 and the slowdown started 6-8 years ago; likewise, AlexNet, the breakout moment for deep learning, was 9 years ago. So if compute price is the primary rate-limiter, I’d think it would have a more gradual, consistent effect as models get bigger and bigger. The slowdown may mean that models cost quite a lot to train, but clearly huge companies like Nvidia and Microsoft haven’t started shying away yet from spending absurd amounts of money to keep growing their models.
This seems to reflect a noticeable slowdown in the rate at which language model sizes increase. Compare the trend line you’d draw through the prior points to the one on the graph.
I’m still disappointed at the limited context window (2048 tokens). If you’re going to spend millions on training a transformer, may as well make it one of the linear time complexity variants.
It looks like Turing NLG models are autoregressive generative, like GPTs. So, not good at things like rephrasing text sections based on bidirectional context, but good at unidirectional language generation. I’m confused as to why everyone is focusing on unidirectional models. It seems like, if you want to provide a distinct service compared to your competition, bidirectionality would be the way to go. Then your model would be much better at things like evaluating text content, rephrasing, or grammar checking. Maybe the researchers want to be able to compare results with prior work?
I’d hesitate to make predictions based on the slowdown of GPT-3 to Megatron-Turing, for two reasons.
First, GPT-3 represents the fastest, largest increase in model size in this whole chart. If you only look at the models before GPT-3, the drawn trend line tracks well. Note how far off the trend GPT-3 itself is.
Second, GPT-3 was released almost exactly when COVID became a serious concern in the world beyond China. I must imagine that this slowed down model development, but it will be less of a factor going forward.
If you just look at models before GPT-3, the trend line you’d draw is still noticeably steeper than the actual line on the graph. (ELMO and BERT large are below trend while T5 and Megatron 8.3B are above.) The new Megatron would represent the biggest trend line undershoot.
Also, I think any post COVID speedup will be more than drown out by the recent slow down in the rate at which compute prices fall. They were dropping by an OOM every 4 years, but now it’s every 10-16 years.
We miss GPT-4 data point for this chart. New Megatron is more like one more replication of GPT-3, and another one is new Chinese 245 Billion parameters model. But Google had trillion parameter model in fall 2020.
Better citation for Chinese 245B model here: Yuan 1.0: Large-Scale Pre-trained Language Model in Zero-Shot and Few-Shot Learning.
That’s new to me. Any citation for this?
https://arxiv.org/abs/2101.03961
It was not google, but Microsoft: in September 2020 they wrote: “The trillion-parameter model has 298 layers of Transformers with a hidden dimension of 17,408 and is trained with sequence length 2,048 and batch size 2,048.” https://www.microsoft.com/en-us/research/blog/deepspeed-extreme-scale-model-training-for-everyone/
That was just the tech demo. It obviously wasn’t actually trained, just demoed for a few steps to show that the code works. If they had trained it back then, they’d, well, have announced it like OP! OP is what they’ve trained for-reals after further fiddling with that code.
I’m not sure how relevant the slowdown in compute price decrease is to this chart, since it starts in 2018 and the slowdown started 6-8 years ago; likewise, AlexNet, the breakout moment for deep learning, was 9 years ago. So if compute price is the primary rate-limiter, I’d think it would have a more gradual, consistent effect as models get bigger and bigger. The slowdown may mean that models cost quite a lot to train, but clearly huge companies like Nvidia and Microsoft haven’t started shying away yet from spending absurd amounts of money to keep growing their models.