Also worth noting is that the model was trained in December 2020, a year ago. I don’t know when GPT-3 was trained, but if the time-gap between the two is small, that sure looks like a substantial discontinuity in training efficiency. (Though I’d prefer to see long-run data).
The observation that two SOTA language models trained close together in time were substantially different in measured performance provides evidence of a discontinuity, as defined in the usual sense of a large residual from prior extrapolation.
I can answer your question literally: I don’t think that would be infinitely fast progress. I am genuinely unsure what your point is though. :)
I think there’s a significant point here: that it only makes sense to compare with the expected trend rather than with one data point. In particular, note that if Gopher had been released one day before GPT-3, then GPT-3 wouldn’t have been SOTA, and the time-to-achieve-x-progress would look a lot longer.
Also worth noting is that the model was trained in December 2020, a year ago. I don’t know when GPT-3 was trained, but if the time-gap between the two is small, that sure looks like a substantial discontinuity in training efficiency. (Though I’d prefer to see long-run data).
If two people trained language models at the same time and one was better than the other, would you call it infinitely fast progress?
I’m confused what you’re asking.
The observation that two SOTA language models trained close together in time were substantially different in measured performance provides evidence of a discontinuity, as defined in the usual sense of a large residual from prior extrapolation.
I can answer your question literally: I don’t think that would be infinitely fast progress. I am genuinely unsure what your point is though. :)
I think there’s a significant point here: that it only makes sense to compare with the expected trend rather than with one data point.
In particular, note that if Gopher had been released one day before GPT-3, then GPT-3 wouldn’t have been SOTA, and the time-to-achieve-x-progress would look a lot longer.
(FWIW, it still seems like a discontinuity to me)
GPT-3 appeared on arXiv in May 2020: https://arxiv.org/abs/2005.14165
Though I don’t know exactly when it was trained.
It was trained with internet data from October 2019. So it must have been trained between October 2019 and May 2020.