Yes, but they spent more money and created a much larger model than other groups, sooner than I’d otherwise have expected. It also reaches some threshold for “scarily good” for me which makes me surprised.
suggest that it wasn’t a discontinuity in terms of validation loss, which seems to the inverse of perplexity.
Also, from the Wikipedia page:
GPT-3′s full version has a capacity of 175 billion [parameters] [...] Prior to the release of GPT-3, the largest language model was Microsoft’s Turing NLG, introduced in February 2020, with a capacity of 17 billion parameters or less than 10 percent compared to GPT-3.
The year before GPT-2 had 1.5 billion parameters and XLNET had 340M. The year before that, in 2018 BERT had 340M. Here are two charts around that time:
Unclear whether there was a discontinuity roughly at the time of Nvidia’s Megatron, particularly on the logarithmic scale. GPT-3 was 10x the size of Microsoft’s last model, but came 4 months afterwards, which seems like it might maybe break that exponential.
Yes, but they spent more money and created a much larger model than other groups, sooner than I’d otherwise have expected. It also reaches some threshold for “scarily good” for me which makes me surprised.
My impression was that it followed existing trends pretty well, but I haven’t looked into it deeply.
From the paper, charts such as:
suggest that it wasn’t a discontinuity in terms of validation loss, which seems to the inverse of perplexity.Also, from the Wikipedia page:
The year before GPT-2 had 1.5 billion parameters and XLNET had 340M. The year before that, in 2018 BERT had 340M. Here are two charts around that time:
Unclear whether there was a discontinuity roughly at the time of Nvidia’s Megatron, particularly on the logarithmic scale. GPT-3 was 10x the size of Microsoft’s last model, but came 4 months afterwards, which seems like it might maybe break that exponential.