Rohin Shah comments on A prior for technological discontinuities

Rohin Shah 14 Oct 2020 15:48 UTC
3 points
Yes, but they spent more money and created a much larger model than other groups, sooner than I’d otherwise have expected.
My impression was that it followed existing trends pretty well, but I haven’t looked into it deeply.
- NunoSempere 14 Oct 2020 16:45 UTC
  5 points
  Parent
  From the paper, charts such as:
  suggest that it wasn’t a discontinuity in terms of validation loss, which seems to the inverse of perplexity.
  
  Also, from the Wikipedia page:
  
  GPT-3′s full version has a capacity of 175 billion [parameters] [...] Prior to the release of GPT-3, the largest language model was Microsoft’s Turing NLG, introduced in February 2020, with a capacity of 17 billion parameters or less than 10 percent compared to GPT-3.
  
  The year before GPT-2 had 1.5 billion parameters and XLNET had 340M. The year before that, in 2018 BERT had 340M. Here are two charts around that time:
  
  Unclear whether there was a discontinuity roughly at the time of Nvidia’s Megatron, particularly on the logarithmic scale. GPT-3 was 10x the size of Microsoft’s last model, but came 4 months afterwards, which seems like it might maybe break that exponential.