There is some point at which it’s gaining a given capability for the first time though, right? In earlier training stages I would expect the output to be gobbledygook, and then at some point it starts spelling out actual words. (I realize I’m conflating parameters and training compute, but I would expect a model with few enough parameters to output gobbledygook even when fully trained.)
So my read of the de-noising argument is that at current scaling margins we shouldn’t expect new capabilities—is that correct? Part of the evidence being that GPT-3 doesn’t show new capabilities over GPT-2. This also implies that capability gains are all front-loaded in the lower numbers of parameters.
As an aside that people might find interesting: this recent paper shows OpenAI Codex succeeding on college math problems simply once prompted in a particular way. So in this case the capability was there in GPT-3 all along…we just had to find it.
There is some point at which it’s gaining a given capability for the first time though, right? [...]
So my read of the de-noising argument is that at current scaling margins we shouldn’t expect new capabilities—is that correct?
Not quite.
If you define some capability in a binary yes-no way, where it either “has it” or “doesn’t have it”—then yes, there are models that “have it” and those that “don’t,” and there is some scale where models start “having it.”
But this apparent “switch flip” is almost always an artifact of the map, not a part of the territory.
Suppose we operationalize “having the capability” as “scoring better than chance on some test of the capability.” What we’ll find is that models smoothly move from doing no better than chance, to doing 1% better than chance, to doing 2% better . . . (numbers are meant qualitatively).
If we want, we can point at the model that does 1% better than chance and say “it got the capability right here,” but (A) this model doesn’t have the capability in any colloquial sense of the term, and (B) if we looked harder, we could probably find an intermediate model that does 0.5% better, or 0.25%...
(By the time the model does better than chance at something in a way that is noticeable to a human, it will typically have been undergoing a smooth, continuous increase in performance for a while already.)
There is some point at which it’s gaining a given capability for the first time though, right? In earlier training stages I would expect the output to be gobbledygook, and then at some point it starts spelling out actual words. (I realize I’m conflating parameters and training compute, but I would expect a model with few enough parameters to output gobbledygook even when fully trained.)
So my read of the de-noising argument is that at current scaling margins we shouldn’t expect new capabilities—is that correct? Part of the evidence being that GPT-3 doesn’t show new capabilities over GPT-2. This also implies that capability gains are all front-loaded in the lower numbers of parameters.
As an aside that people might find interesting: this recent paper shows OpenAI Codex succeeding on college math problems simply once prompted in a particular way. So in this case the capability was there in GPT-3 all along…we just had to find it.
Not quite.
If you define some capability in a binary yes-no way, where it either “has it” or “doesn’t have it”—then yes, there are models that “have it” and those that “don’t,” and there is some scale where models start “having it.”
But this apparent “switch flip” is almost always an artifact of the map, not a part of the territory.
Suppose we operationalize “having the capability” as “scoring better than chance on some test of the capability.” What we’ll find is that models smoothly move from doing no better than chance, to doing 1% better than chance, to doing 2% better . . . (numbers are meant qualitatively).
If we want, we can point at the model that does 1% better than chance and say “it got the capability right here,” but (A) this model doesn’t have the capability in any colloquial sense of the term, and (B) if we looked harder, we could probably find an intermediate model that does 0.5% better, or 0.25%...
(By the time the model does better than chance at something in a way that is noticeable to a human, it will typically have been undergoing a smooth, continuous increase in performance for a while already.)