Matthew Barnett comments on Deepmind’s Gopher—more powerful than GPT-3

Matthew Barnett 8 Dec 2021 20:06 UTC
19 points
Also worth noting is that the model was trained in December 2020, a year ago. I don’t know when GPT-3 was trained, but if the time-gap between the two is small, that sure looks like a substantial discontinuity in training efficiency. (Though I’d prefer to see long-run data).
- paulfchristiano 9 Dec 2021 3:22 UTC
  11 points
  Parent
  If two people trained language models at the same time and one was better than the other, would you call it infinitely fast progress?
  - Matthew Barnett 9 Dec 2021 5:22 UTC
    17 points
    Parent
    I’m confused what you’re asking.
    
    The observation that two SOTA language models trained close together in time were substantially different in measured performance provides evidence of a discontinuity, as defined in the usual sense of a large residual from prior extrapolation.
    
    I can answer your question literally: I don’t think that would be infinitely fast progress. I am genuinely unsure what your point is though. :)
    - Joe Collman 9 Dec 2021 20:25 UTC
      4 points
      Parent
      I think there’s a significant point here: that it only makes sense to compare with the expected trend rather than with one data point.
      In particular, note that if Gopher had been released one day before GPT-3, then GPT-3 wouldn’t have been SOTA, and the time-to-achieve-x-progress would look a lot longer.
      (FWIW, it still seems like a discontinuity to me)
- LawrenceC 8 Dec 2021 20:18 UTC
  7 points
  Parent
  GPT-3 appeared on arXiv in May 2020: https://arxiv.org/abs/2005.14165
  Though I don’t know exactly when it was trained.
  - Lone Pine 8 Dec 2021 21:12 UTC
    11 points
    Parent
    It was trained with internet data from October 2019. So it must have been trained between October 2019 and May 2020.