Even the largest GPT-2 model was underparameterized.
Overparameterized = you can predict your training data perfectly. Our performance on memory tests show that humans are nowhere close to that.
There’s a truly stupendous amount of data that we are exposed to; imagine predicting everything you have ever seen or heard.
The existing approaches I hear of for creating AGI sound like “lets redo evolution”; if a lifetime of data didn’t already sound like a stupendously large amount, the amount of data used by evolution puts that to shame.
On 2): Being overparameterized doesn’t mean you fit all your training data. It just means that you could fit it with enough optimization. Perhaps the existence of some Savant people shows that the brain could memorize way more than it does.
On 3): The number of our synaptic weights is stupendous too—about 30000 for every second in our life.
On 4): You can underfit at the evolution level and still overparameterize at the individual level.
Overall you convinced me that underparameterization is less likely though. Especially on your definition of overparameterization, which is relevant for double descent.
2) The “larger models are simpler” happens only after training to zero loss (at least if you’re using the double descent explanation for it, which is what I was thinking of).
3) Fair point; though note that for that you should also count up all the other things the brain has to do (e.g. motor control)
4) If “redoing evolution” produces AGI; I would expect that a mesa optimizer would “come from” the evolution, not the individual level; so to the extent you want to argue “double descent implies simple large models implies mesa optimization”, you have to apply that argument to evolution.
(Probably you were asking about the question independently of the mesa optimization point; I do still hold this opinion more weakly for generic “AI systems of the future”; there the intuition comes from humans being underparameterized and from an intuition that AI systems of the future should be able to make use of more cheap, diverse / noisy data, e.g. YouTube.)
To be clear, I broadly agree that AGI will be quite underparameterized, but still maintain that double descent demonstrates something—that larger models can do better by being simpler not just by fitting more data—that I think is still quite important.
The existing approaches I hear of for creating AGI sound like “lets redo evolution”; if a lifetime of data didn’t already sound like a stupendously large amount, the amount of data used by evolution puts that to shame.
It’s also not just pre-existing data, it’s partially manufactured. (The environment changes, and some of that change is a result of evolution (within and without species).) The distinction mattered in Go—AGZ just learned ‘from playing against itself’, while its predecessor looked at a dataset of human games.
Intuitively, if a domain is complicated enough that data must be manufactured to perform well enough then that domain is complicated enough AI systems in that domain will be underparameterized.
And “Data is created” sounds like a control problem.
But was AGZ’s predecessor overparameterized? I don’t know. The line between selection and control isn’t clear in the problem or that solution—Go and AGZ respectively.
Overparameterized = you can predict your training data perfectly. Our performance on memory tests show that humans are nowhere close to that.
If AGZ is better than anyone else at Go, is it overparameterized? It’s not labeling, it’s competing and making choices. The moves/choices it makes might not be the optimal (relative to no resource constraints), and that optimum might never be achieved, so yes—underparameterized, forever. But it’s better than the trainers, so no—it’s overparameterized, it’s making better choices than the data set it wasn’t provided with.
A word’s been suggested before for when technology reach the point that humans don’t add anything—I don’t remember it. (This is true for 1v1 in chess, but with teams humans become useful again. I’d guess Go would be the same, but I haven’t heard about it.)
I’d say AGZ demonstrates Super performance—when the performer surpasses prior experts (and their datasets), and becomes the new expert (and source of data).
A few reasons:
Even the largest GPT-2 model was underparameterized.
Overparameterized = you can predict your training data perfectly. Our performance on memory tests show that humans are nowhere close to that.
There’s a truly stupendous amount of data that we are exposed to; imagine predicting everything you have ever seen or heard.
The existing approaches I hear of for creating AGI sound like “lets redo evolution”; if a lifetime of data didn’t already sound like a stupendously large amount, the amount of data used by evolution puts that to shame.
Thanks!
On 2): Being overparameterized doesn’t mean you fit all your training data. It just means that you could fit it with enough optimization. Perhaps the existence of some Savant people shows that the brain could memorize way more than it does.
On 3): The number of our synaptic weights is stupendous too—about 30000 for every second in our life.
On 4): You can underfit at the evolution level and still overparameterize at the individual level.
Overall you convinced me that underparameterization is less likely though. Especially on your definition of overparameterization, which is relevant for double descent.
2) The “larger models are simpler” happens only after training to zero loss (at least if you’re using the double descent explanation for it, which is what I was thinking of).
3) Fair point; though note that for that you should also count up all the other things the brain has to do (e.g. motor control)
4) If “redoing evolution” produces AGI; I would expect that a mesa optimizer would “come from” the evolution, not the individual level; so to the extent you want to argue “double descent implies simple large models implies mesa optimization”, you have to apply that argument to evolution.
(Probably you were asking about the question independently of the mesa optimization point; I do still hold this opinion more weakly for generic “AI systems of the future”; there the intuition comes from humans being underparameterized and from an intuition that AI systems of the future should be able to make use of more cheap, diverse / noisy data, e.g. YouTube.)
Sounds like we agree :)
To be clear, I broadly agree that AGI will be quite underparameterized, but still maintain that double descent demonstrates something—that larger models can do better by being simpler not just by fitting more data—that I think is still quite important.
It’s also not just pre-existing data, it’s partially manufactured. (The environment changes, and some of that change is a result of evolution (within and without species).) The distinction mattered in Go—AGZ just learned ‘from playing against itself’, while its predecessor looked at a dataset of human games.
I agree with that, but how does it matter for whether AI systems will be underparameterized?
Intuitively, if a domain is complicated enough that data must be manufactured to perform well enough then that domain is complicated enough AI systems in that domain will be underparameterized.
And “Data is created” sounds like a control problem.
But was AGZ’s predecessor overparameterized? I don’t know. The line between selection and control isn’t clear in the problem or that solution—Go and AGZ respectively.
If AGZ is better than anyone else at Go, is it overparameterized? It’s not labeling, it’s competing and making choices. The moves/choices it makes might not be the optimal (relative to no resource constraints), and that optimum might never be achieved, so yes—underparameterized, forever. But it’s better than the trainers, so no—it’s overparameterized, it’s making better choices than the data set it wasn’t provided with.
A word’s been suggested before for when technology reach the point that humans don’t add anything—I don’t remember it. (This is true for 1v1 in chess, but with teams humans become useful again. I’d guess Go would be the same, but I haven’t heard about it.)
I’d say AGZ demonstrates Super performance—when the performer surpasses prior experts (and their datasets), and becomes the new expert (and source of data).