suggest that it wasn’t a discontinuity in terms of validation loss, which seems to the inverse of perplexity.
Also, from the Wikipedia page:
GPT-3′s full version has a capacity of 175 billion [parameters] [...] Prior to the release of GPT-3, the largest language model was Microsoft’s Turing NLG, introduced in February 2020, with a capacity of 17 billion parameters or less than 10 percent compared to GPT-3.
The year before GPT-2 had 1.5 billion parameters and XLNET had 340M. The year before that, in 2018 BERT had 340M. Here are two charts around that time:
Unclear whether there was a discontinuity roughly at the time of Nvidia’s Megatron, particularly on the logarithmic scale. GPT-3 was 10x the size of Microsoft’s last model, but came 4 months afterwards, which seems like it might maybe break that exponential.
My impression was that it followed existing trends pretty well, but I haven’t looked into it deeply.
From the paper, charts such as:
suggest that it wasn’t a discontinuity in terms of validation loss, which seems to the inverse of perplexity.Also, from the Wikipedia page:
The year before GPT-2 had 1.5 billion parameters and XLNET had 340M. The year before that, in 2018 BERT had 340M. Here are two charts around that time:
Unclear whether there was a discontinuity roughly at the time of Nvidia’s Megatron, particularly on the logarithmic scale. GPT-3 was 10x the size of Microsoft’s last model, but came 4 months afterwards, which seems like it might maybe break that exponential.