I was trying to argue (among other things) that scaling up basically current methods could result in an increase in productivity among OpenAI capabilities researchers at least equivalent to the productivity you’d get as if the human employees operated 10x faster
You’re right, that’s a meaningfully different claim and I should’ve noticed the difference.
I think I would disagree with it as well. Suppose we break up this labor into, say,
“Banal” software engineering.
Medium-difficult systems design and algorithmic improvements (finding optimizations, etc.).
Coming up with new ideas regarding how AI capabilities can be progressed.
High-level decisions regarding architectures, research avenues and strategies, etc. (Not just inventing transformers/the scaling hypothesis/the idea of RL-on-CoT, but picking those approaches out of a sea of ideas, and making the correct decision to commit hard to them.)
In turn, the factors relevant to (4) are:
(a) The serial thinking of the senior researchers and the communication/exchange of ideas between them.
(Where “the senior researchers” are defined as “the people with the power to make strategic research decisions at a given company”.)
(b) The outputs of significant experiments decided on by the senior researchers.
(c) The pool of untested-at-large-scale ideas presented to the senior researchers.
Importantly, in this model, speeding up (1), (2), (3) can only speed up (4) by increasing the turnover speed of (b) and the quality of (c). And I expect that non-AGI-complete AI cannot improve the quality of ideas (3) and cannot directly speed up/replace (a)[1], meaning any acceleration from it can only come from accelerating the engineering and the optimization of significant experiments.
Which, I expect, are in fact mostly bottlenecked by compute, and 10x’ing the human-labor productivity there doesn’t 10x the overall productivity of the human-labor input; it remains stubbornly held up by (a). (I do buy that it can significantly speed it up, say 2x it. But not 10x it.)
Separately, I’m also skeptical that near-term AI can speed up the nontrivial engineering involved in medium-difficult systems design and the management of significant experiments:
Stepping back from engineering vs insights, my sense is that it isn’t clear that the AIs will be terrible at insights or broader context. So, I think it will probably be more like they are very fast engineers and ok at experimental direction. Being ok helps a bunch by avoiding the need for human intervention at many points.
It seems to me that AIs have remained stubbornly terrible at this from GPT-3 to GPT-4 to Sonnet 3.5.1 to o1[2]; that the improvement on this hard-to-specify quality has been ~0. I guess we’ll see if o3 (or an o-series model based on the next-generation base model) change that. AI does feel right on the cusp of getting good at this...
… just as it felt at the time of GPT-3.5, and GPT-4, and Sonnet 3.5.1, and o1. That just the slightest improvement along this axis would allow us to plug the outputs of AI cognition into its inputs and get a competent, autonomous AI agent.
And yet here we are, still.
It’s puzzling to me and I don’t quite understand why it wouldn’t work, but based on the previous track record, I do in fact expect it not to work.
In other words: If an AI is able to improve the quality of ideas and/or reliably pluck out the best ideas from a sea of them, I expect that’s AGI and we can throw out all human cognitive labor entirely.
It seems to me that AIs have remained stubbornly terrible at this from GPT-3 to GPT-4 to Sonnet 3.5.1 to o1[2]; that the improvement on this hard-to-specify quality has been ~0.
Huh, I disagree reasonably strongly with this. Possible that something along these lines is an empirically testable crux.
FWIW my vibe is closer to Thane’s. Yesterday I commented that this discussion has been raising some topics that seem worthy of a systematic writeup as fodder for further discussion. I think here we’ve hit on another such topic: enumerating important dimensions of AI capability – such as generation of deep insights, or taking broader context into account – and then kicking off a discussion of the past trajectory / expected future progress on each dimension.
Some benchmarks got saturated across this range, so we can imagine “anti-saturated” benchmarks that didn’t yet noticeably move from zero, operationalizing intuitions of lack of progress. Performance on such benchmarks still has room to change significantly even with pretraining scaling in the near future, from 1e26 FLOPs of currently deployed models to 5e28 FLOPs by 2028, 500x more.
Yup, those two do seem to be the cruxes here.
You’re right, that’s a meaningfully different claim and I should’ve noticed the difference.
I think I would disagree with it as well. Suppose we break up this labor into, say,
“Banal” software engineering.
Medium-difficult systems design and algorithmic improvements (finding optimizations, etc.).
Coming up with new ideas regarding how AI capabilities can be progressed.
High-level decisions regarding architectures, research avenues and strategies, etc. (Not just inventing transformers/the scaling hypothesis/the idea of RL-on-CoT, but picking those approaches out of a sea of ideas, and making the correct decision to commit hard to them.)
In turn, the factors relevant to (4) are:
(a) The serial thinking of the senior researchers and the communication/exchange of ideas between them.
(Where “the senior researchers” are defined as “the people with the power to make strategic research decisions at a given company”.)
(b) The outputs of significant experiments decided on by the senior researchers.
(c) The pool of untested-at-large-scale ideas presented to the senior researchers.
Importantly, in this model, speeding up (1), (2), (3) can only speed up (4) by increasing the turnover speed of (b) and the quality of (c). And I expect that non-AGI-complete AI cannot improve the quality of ideas (3) and cannot directly speed up/replace (a)[1], meaning any acceleration from it can only come from accelerating the engineering and the optimization of significant experiments.
Which, I expect, are in fact mostly bottlenecked by compute, and 10x’ing the human-labor productivity there doesn’t 10x the overall productivity of the human-labor input; it remains stubbornly held up by (a). (I do buy that it can significantly speed it up, say 2x it. But not 10x it.)
Separately, I’m also skeptical that near-term AI can speed up the nontrivial engineering involved in medium-difficult systems design and the management of significant experiments:
It seems to me that AIs have remained stubbornly terrible at this from GPT-3 to GPT-4 to Sonnet 3.5.1 to o1[2]; that the improvement on this hard-to-specify quality has been ~0. I guess we’ll see if o3 (or an o-series model based on the next-generation base model) change that. AI does feel right on the cusp of getting good at this...
… just as it felt at the time of GPT-3.5, and GPT-4, and Sonnet 3.5.1, and o1. That just the slightest improvement along this axis would allow us to plug the outputs of AI cognition into its inputs and get a competent, autonomous AI agent.
And yet here we are, still.
It’s puzzling to me and I don’t quite understand why it wouldn’t work, but based on the previous track record, I do in fact expect it not to work.
In other words: If an AI is able to improve the quality of ideas and/or reliably pluck out the best ideas from a sea of them, I expect that’s AGI and we can throw out all human cognitive labor entirely.
Arguably, no improvement since GPT-2; I think that post aged really well.
Huh, I disagree reasonably strongly with this. Possible that something along these lines is an empirically testable crux.
FWIW my vibe is closer to Thane’s. Yesterday I commented that this discussion has been raising some topics that seem worthy of a systematic writeup as fodder for further discussion. I think here we’ve hit on another such topic: enumerating important dimensions of AI capability – such as generation of deep insights, or taking broader context into account – and then kicking off a discussion of the past trajectory / expected future progress on each dimension.
Some benchmarks got saturated across this range, so we can imagine “anti-saturated” benchmarks that didn’t yet noticeably move from zero, operationalizing intuitions of lack of progress. Performance on such benchmarks still has room to change significantly even with pretraining scaling in the near future, from 1e26 FLOPs of currently deployed models to 5e28 FLOPs by 2028, 500x more.