There are other reasons why top mathematicians could have better output compared to average mathematicians. They could be working on more salient problems, there’s selection bias in who we call a “top mathematician”, they could be situated in an intellectual microcosm more suitable for mathematical progress, etc.
Do you really think these things contribute much to a factor of a thousand? Roughly speaking, what I’m talking about here is how much longer it would take for an average mathematician to reproduce the works of Terry Tao (assuming the same prior information as Terry had before figuring out the things he figured out, of course).
However, those log(n) bits of optimization pressure are being directly applied towards that goal, and it’s not easy to have a learning process that applies optimization pressure in a similarly direct manner (as opposed to optimizing for something like “ability to do well on this math problem dataset”).
I think Terry Tao would do noticeably much better on a math problem dataset compared to most other mathematicians! This is where it’s important to note that “optimization in vs. optimization out” is not actually a single “steepness” parameter, but the shape of a curve. If the thing you’re optimizing doesn’t already have the rough shape of an optimizer, then maybe you aren’t really managing to do much meta-optimization. In other words, the scaling might not be very steep because, as you said, it’s hard to figure out exactly how to direct “dumb” (i.e. SGD) optimization pressure.
But suppose you’ve trained an absolutely massive model that’s managed to stumble onto the “rough shape of an optimizer” and is now roughly human-level. It seems obvious to me that you don’t need to push on this thing very hard to get what we would recognize as massive performance increases for the reason above: it’s not very hard to pick out a Terry Tao from the Earth’s supply of mathematicians, even by dumb optimization on a pretty simple metric (such as performance on some math dataset).
Finally, AI Impacts has done a number of investigations into how long it took for AI systems to go from ~human level to better than human level in different domains. E.g., it took 10 years for diagnosis of diabetic retinopathy. I think this line of research is more directly informative on this question.
I don’t see this as very informative about how optimizers scale as you apply meta-optimization. If the thing you’re optimizing is not really itself an optimizer (e.g. a narrow domain tool), then what you’re measuring is more akin to the total amount of optimization you’ve put into it, rather than the strength of the optimizer you’ve produced by applying meta-optimization.
Do you really think these things contribute much to a factor of a thousand? Roughly speaking, what I’m talking about here is how much longer it would take for an average mathematician to reproduce the works of Terry Tao (assuming the same prior information as Terry had before figuring out the things he figured out, of course).
I think Terry Tao would do noticeably much better on a math problem dataset compared to most other mathematicians! This is where it’s important to note that “optimization in vs. optimization out” is not actually a single “steepness” parameter, but the shape of a curve. If the thing you’re optimizing doesn’t already have the rough shape of an optimizer, then maybe you aren’t really managing to do much meta-optimization. In other words, the scaling might not be very steep because, as you said, it’s hard to figure out exactly how to direct “dumb” (i.e. SGD) optimization pressure.
But suppose you’ve trained an absolutely massive model that’s managed to stumble onto the “rough shape of an optimizer” and is now roughly human-level. It seems obvious to me that you don’t need to push on this thing very hard to get what we would recognize as massive performance increases for the reason above: it’s not very hard to pick out a Terry Tao from the Earth’s supply of mathematicians, even by dumb optimization on a pretty simple metric (such as performance on some math dataset).
I don’t see this as very informative about how optimizers scale as you apply meta-optimization. If the thing you’re optimizing is not really itself an optimizer (e.g. a narrow domain tool), then what you’re measuring is more akin to the total amount of optimization you’ve put into it, rather than the strength of the optimizer you’ve produced by applying meta-optimization.