I think you are somewhat overly fixated on my claim that “maybe the AIs will accelerate the labor input R&D by 10x via basically just being fast and cheap junior employees”. My original claim (in the subcomment) is “I think it could suffice to do a bunch of relatively more banal things extremely fast and cheap”. The “could” part is important. Correspondingly, I think this is only part of the possibilities, though I do think this is a pretty plausible route. Additionally, banal does not imply simple/easy and some level of labor quality will be needed.
(I did propose junior employees as an analogy which maybe implied simple/easy. I didn’t really intend this implication. I think the AIs have to be able to do at least somewhat hard tasks, but maybe don’t need to have a ton of context or have much taste if they can compensate with other advantages.)
I’ll argue against your comment, but first, I’d like to lay out a bunch of background to make sure we’re on the same page and to give a better understanding to people reading through.
Frontier LLM progress has historically been driven by 3 factors:
Increased spending on training runs ($)
Hardware progress (compute / $)
Algorithmic progress (intelligence / compute)
(The split seems to be very roughly 2⁄5, 1⁄5, 2⁄5 respectively.)
If we zoom into algorithmic progress, there are two relevant inputs to the production function:
Compute (for experiments)
Labor (from human researchers and engineers)
A reasonably common view is that compute is a very key bottleneck such that even if you greatly improved labor, algorithmic progress wouldn’t go much faster. This seems plausible to me (though somewhat unlikely), but this isn’t what I was arguing about. I was trying to argue (among other things) that scaling up basically current methods could result in an increase in productivity among OpenAI capabilities researchers at least equivalent to the productivity you’d get as if the human employees operated 10x faster. (In other words, 10x’ing this labor input.)
Now, I’ll try to respond to your claims.
My current model is that ML experiments are bottlenecked not on software-engineer hours, but on compute.
Maybe, but that isn’t exactly a crux in this discussion as noted above. The relevant question is whether the important labor going into ML experiments is more “insights” or “engineering” (not whether both of these are bottlenecked on compute).
What actually matters for ML-style progress is picking the correct trick, and then applying it to a big-enough model.
My sense is that engineering is most of the labor, and most people I talk to with relevant experience have a view like: “taste is somewhat important, but lots of people have that and fast execution is roughly as important or more important”. Notably, AI companies really want to hire fast and good engineers and seem to care comparably about this as about more traditional research scientist jobs.
One relevant response would be “sure, AI companies want to hire good engineers, but weren’t we talking about the AIs being bad engineers who run fast?”
I think the AI engineers probably have to be quite good at moderate horizon software engineering, but also that scaling up current approaches can pretty likely achieve this. Possibly my “junior hire” analogy was problematic as “junior hire” can mean not as good at programming in addition to “not as much context at this company, but good at the general skills”.
So 10x’ing the number of small-scale experiments is unlikely to actually 10x ML research, along any promising research direction.
I wasn’t saying that these AIs would mostly be 10x’ing the number of small-scale experiments, though I do think that increasing the number and serial speed of experiments is an important part of the picture.
There are lots of other things that engineers do (e.g., increase the efficiency of experiments so they use less compute, make it much easier to run experiments, etc.).
Indeed, an additional disadvantage of AI-based researchers/engineers is that their forward passes would cut into that limited compute budget. Offloading the computations associated with software engineering and experiment oversight onto the brains of mid-level human engineers is potentially more cost-efficient.
Sure, but we have to be quantitative here. As a rough (and somewhat conservative) estimate, if I were to manage 50 copies of 3.5 Sonnet who are running 1⁄4 of the time (due to waiting for experiments, etc), that would cost roughly 50 copies * 70 tok / s * 1 / 4 uptime * 60 * 60 * 24 * 365 sec / year * (15 / 1,000,000) $ / tok = $400,000. This cost is comparable to salaries at current compute prices and probably much less than how much AI companies would be willing to pay for top employees. (And note this is after API markups etc. I’m not including input prices for simplicity, but input is much cheaper than output and it’s just a messy BOTEC anyway.)
Yes, this compute comes directly at the cost of experiments, but so do employee salaries at current margins. (Maybe this will be less true in the future.)
At the point when AIs are first capable of doing the relevant tasks, it seems likely it is pretty expensive, but I expect costs to drop pretty quickly. And, AI companies will have far more compute in the future as this increases at a rapid rate, making the plausible number of instances substantially higher.
Is there a reason to think that any need for that couldn’t already be satisfied? If it were an actual bottleneck, I would expect it to have already been solved: by the AGI labs just hiring tons of competent-ish software engineers.
I think AI companies would be very happy to hire lots of software engineers who work for nearly free, run 10x faster, work 24⁄7, and are pretty good research engineers. This seems especially true if you add other structural advantages of AI into the mix (train once and use many times, fewer personnel issues, easy to scale up and down, etc). The serial speed is very important.
(The bar of “competent-ish” seems too low. Again, I think “junior” might have been leading you astray here, sorry about that. Imagine more like median AI company engineering hire or a bit better than this. My original comment said “automating research engineering”.)
LLM-based coding tools seem competent enough to significantly speed up a human programmer’s work on formulaic tasks. So any sufficiently simple software-engineering task should already be done at lightning speeds within AGI labs.
I’m not sure I buy this claim about current tools. Also, I wasn’t making a claim about AIs just doing simple tasks (banal does not mean simple) as discussed earlier.
Stepping back from engineering vs insights, my sense is that it isn’t clear that the AIs will be terrible at insights or broader context. So, I think it will probably be more like they are very fast engineers and ok at experimental direction. Being ok helps a bunch by avoiding the need for human intervention at many points.
Maybe a relevant crux is: “Could scaling up current methods yield AIs that can mostly autonomously automate software engineering tasks that are currently being done by engineers at AI companies?” (More precisely, succeed at these tasks very reliably with only a small amount of human advice/help amortized over all tasks. Probably this would partially work by having humans or AIs decompose into relatively smaller subtasks that require a bit less context, though this isn’t notably different from how humans do things themselves.)
But, I think you maybe also have a further crux like: “Does making software engineering at AI companies cheap and extremely fast greatly accelerate the labor input to AI R&D?”
I was trying to argue (among other things) that scaling up basically current methods could result in an increase in productivity among OpenAI capabilities researchers at least equivalent to the productivity you’d get as if the human employees operated 10x faster
You’re right, that’s a meaningfully different claim and I should’ve noticed the difference.
I think I would disagree with it as well. Suppose we break up this labor into, say,
“Banal” software engineering.
Medium-difficult systems design and algorithmic improvements (finding optimizations, etc.).
Coming up with new ideas regarding how AI capabilities can be progressed.
High-level decisions regarding architectures, research avenues and strategies, etc. (Not just inventing transformers/the scaling hypothesis/the idea of RL-on-CoT, but picking those approaches out of a sea of ideas, and making the correct decision to commit hard to them.)
In turn, the factors relevant to (4) are:
(a) The serial thinking of the senior researchers and the communication/exchange of ideas between them.
(Where “the senior researchers” are defined as “the people with the power to make strategic research decisions at a given company”.)
(b) The outputs of significant experiments decided on by the senior researchers.
(c) The pool of untested-at-large-scale ideas presented to the senior researchers.
Importantly, in this model, speeding up (1), (2), (3) can only speed up (4) by increasing the turnover speed of (b) and the quality of (c). And I expect that non-AGI-complete AI cannot improve the quality of ideas (3) and cannot directly speed up/replace (a)[1], meaning any acceleration from it can only come from accelerating the engineering and the optimization of significant experiments.
Which, I expect, are in fact mostly bottlenecked by compute, and 10x’ing the human-labor productivity there doesn’t 10x the overall productivity of the human-labor input; it remains stubbornly held up by (a). (I do buy that it can significantly speed it up, say 2x it. But not 10x it.)
Separately, I’m also skeptical that near-term AI can speed up the nontrivial engineering involved in medium-difficult systems design and the management of significant experiments:
Stepping back from engineering vs insights, my sense is that it isn’t clear that the AIs will be terrible at insights or broader context. So, I think it will probably be more like they are very fast engineers and ok at experimental direction. Being ok helps a bunch by avoiding the need for human intervention at many points.
It seems to me that AIs have remained stubbornly terrible at this from GPT-3 to GPT-4 to Sonnet 3.5.1 to o1[2]; that the improvement on this hard-to-specify quality has been ~0. I guess we’ll see if o3 (or an o-series model based on the next-generation base model) change that. AI does feel right on the cusp of getting good at this...
… just as it felt at the time of GPT-3.5, and GPT-4, and Sonnet 3.5.1, and o1. That just the slightest improvement along this axis would allow us to plug the outputs of AI cognition into its inputs and get a competent, autonomous AI agent.
And yet here we are, still.
It’s puzzling to me and I don’t quite understand why it wouldn’t work, but based on the previous track record, I do in fact expect it not to work.
In other words: If an AI is able to improve the quality of ideas and/or reliably pluck out the best ideas from a sea of them, I expect that’s AGI and we can throw out all human cognitive labor entirely.
It seems to me that AIs have remained stubbornly terrible at this from GPT-3 to GPT-4 to Sonnet 3.5.1 to o1[2]; that the improvement on this hard-to-specify quality has been ~0.
Huh, I disagree reasonably strongly with this. Possible that something along these lines is an empirically testable crux.
FWIW my vibe is closer to Thane’s. Yesterday I commented that this discussion has been raising some topics that seem worthy of a systematic writeup as fodder for further discussion. I think here we’ve hit on another such topic: enumerating important dimensions of AI capability – such as generation of deep insights, or taking broader context into account – and then kicking off a discussion of the past trajectory / expected future progress on each dimension.
Some benchmarks got saturated across this range, so we can imagine “anti-saturated” benchmarks that didn’t yet noticeably move from zero, operationalizing intuitions of lack of progress. Performance on such benchmarks still has room to change significantly even with pretraining scaling in the near future, from 1e26 FLOPs of currently deployed models to 5e28 FLOPs by 2028, 500x more.
Sure, but we have to be quantitative here. As a rough (and somewhat conservative) estimate, if I were to manage 50 copies of 3.5 Sonnet who are running 1⁄4 of the time (due to waiting for experiments, etc), that would cost roughly 50 copies * 70 tok / s * 1 / 4 uptime * 60 * 60 * 24 * 365 sec / year * (15 / 1,000,000) $ / tok = $400,000. This cost is comparable to salaries at current compute prices and probably much less than how much AI companies would be willing to pay for top employees. (And note this is after API markups etc. I’m not including input prices for simplicity, but input is much cheaper than output and it’s just a messy BOTEC anyway.)
If you were to spend equal amounts of money on LLM inference and GPUs, that would mean that you’re spending $400,000 / year on GPUs. Divide that 50 ways and each Sonnet instance gets an $8,000 / year compute budget. Over the 18 hours per day that Sonnet is waiting for experiments, that is an average of $1.22 / hour, which is almost exactly the hourly cost of renting a single H100 on Vast.
So I guess the crux is “would a swarm of unreliable researchers with one good GPU apiece be more effective at AI research than a few top researchers who can monopolize X0,000 GPUs for months, per unit of GPU time spent”.
(and yes, at some point it the question switches to “would an AI researcher that is better at AI research than the best humans make better use of GPUs than the best humans” but a that point it’s a matter of quality, not quantity)
Sure, but I think that at the relevant point, you’ll probably be spending at least 5x more on experiments than on inference and potentially a much larger larger ratio if heavy test time compute usage isn’t important. I was just trying to argue that the naive inference cost isn’t that crazy.
Notably, if you give each researcher 2k gpu hours, that would be $2 / gpu hour * 2k * 24 * 365 = $35,040,000 per year which is much higher than the inference cost of the models!
I think you are somewhat overly fixated on my claim that “maybe the AIs will accelerate the labor input R&D by 10x via basically just being fast and cheap junior employees”. My original claim (in the subcomment) is “I think it could suffice to do a bunch of relatively more banal things extremely fast and cheap”. The “could” part is important. Correspondingly, I think this is only part of the possibilities, though I do think this is a pretty plausible route. Additionally, banal does not imply simple/easy and some level of labor quality will be needed.
(I did propose junior employees as an analogy which maybe implied simple/easy. I didn’t really intend this implication. I think the AIs have to be able to do at least somewhat hard tasks, but maybe don’t need to have a ton of context or have much taste if they can compensate with other advantages.)
I’ll argue against your comment, but first, I’d like to lay out a bunch of background to make sure we’re on the same page and to give a better understanding to people reading through.
Frontier LLM progress has historically been driven by 3 factors:
Increased spending on training runs ($)
Hardware progress (compute / $)
Algorithmic progress (intelligence / compute)
(The split seems to be very roughly 2⁄5, 1⁄5, 2⁄5 respectively.)
If we zoom into algorithmic progress, there are two relevant inputs to the production function:
Compute (for experiments)
Labor (from human researchers and engineers)
A reasonably common view is that compute is a very key bottleneck such that even if you greatly improved labor, algorithmic progress wouldn’t go much faster. This seems plausible to me (though somewhat unlikely), but this isn’t what I was arguing about. I was trying to argue (among other things) that scaling up basically current methods could result in an increase in productivity among OpenAI capabilities researchers at least equivalent to the productivity you’d get as if the human employees operated 10x faster. (In other words, 10x’ing this labor input.)
Now, I’ll try to respond to your claims.
Maybe, but that isn’t exactly a crux in this discussion as noted above. The relevant question is whether the important labor going into ML experiments is more “insights” or “engineering” (not whether both of these are bottlenecked on compute).
My sense is that engineering is most of the labor, and most people I talk to with relevant experience have a view like: “taste is somewhat important, but lots of people have that and fast execution is roughly as important or more important”. Notably, AI companies really want to hire fast and good engineers and seem to care comparably about this as about more traditional research scientist jobs.
One relevant response would be “sure, AI companies want to hire good engineers, but weren’t we talking about the AIs being bad engineers who run fast?”
I think the AI engineers probably have to be quite good at moderate horizon software engineering, but also that scaling up current approaches can pretty likely achieve this. Possibly my “junior hire” analogy was problematic as “junior hire” can mean not as good at programming in addition to “not as much context at this company, but good at the general skills”.
I wasn’t saying that these AIs would mostly be 10x’ing the number of small-scale experiments, though I do think that increasing the number and serial speed of experiments is an important part of the picture.
There are lots of other things that engineers do (e.g., increase the efficiency of experiments so they use less compute, make it much easier to run experiments, etc.).
Sure, but we have to be quantitative here. As a rough (and somewhat conservative) estimate, if I were to manage 50 copies of 3.5 Sonnet who are running 1⁄4 of the time (due to waiting for experiments, etc), that would cost roughly 50 copies * 70 tok / s * 1 / 4 uptime * 60 * 60 * 24 * 365 sec / year * (15 / 1,000,000) $ / tok = $400,000. This cost is comparable to salaries at current compute prices and probably much less than how much AI companies would be willing to pay for top employees. (And note this is after API markups etc. I’m not including input prices for simplicity, but input is much cheaper than output and it’s just a messy BOTEC anyway.)
Yes, this compute comes directly at the cost of experiments, but so do employee salaries at current margins. (Maybe this will be less true in the future.)
At the point when AIs are first capable of doing the relevant tasks, it seems likely it is pretty expensive, but I expect costs to drop pretty quickly. And, AI companies will have far more compute in the future as this increases at a rapid rate, making the plausible number of instances substantially higher.
I think AI companies would be very happy to hire lots of software engineers who work for nearly free, run 10x faster, work 24⁄7, and are pretty good research engineers. This seems especially true if you add other structural advantages of AI into the mix (train once and use many times, fewer personnel issues, easy to scale up and down, etc). The serial speed is very important.
(The bar of “competent-ish” seems too low. Again, I think “junior” might have been leading you astray here, sorry about that. Imagine more like median AI company engineering hire or a bit better than this. My original comment said “automating research engineering”.)
I’m not sure I buy this claim about current tools. Also, I wasn’t making a claim about AIs just doing simple tasks (banal does not mean simple) as discussed earlier.
Stepping back from engineering vs insights, my sense is that it isn’t clear that the AIs will be terrible at insights or broader context. So, I think it will probably be more like they are very fast engineers and ok at experimental direction. Being ok helps a bunch by avoiding the need for human intervention at many points.
Maybe a relevant crux is: “Could scaling up current methods yield AIs that can mostly autonomously automate software engineering tasks that are currently being done by engineers at AI companies?” (More precisely, succeed at these tasks very reliably with only a small amount of human advice/help amortized over all tasks. Probably this would partially work by having humans or AIs decompose into relatively smaller subtasks that require a bit less context, though this isn’t notably different from how humans do things themselves.)
But, I think you maybe also have a further crux like: “Does making software engineering at AI companies cheap and extremely fast greatly accelerate the labor input to AI R&D?”
Yup, those two do seem to be the cruxes here.
You’re right, that’s a meaningfully different claim and I should’ve noticed the difference.
I think I would disagree with it as well. Suppose we break up this labor into, say,
“Banal” software engineering.
Medium-difficult systems design and algorithmic improvements (finding optimizations, etc.).
Coming up with new ideas regarding how AI capabilities can be progressed.
High-level decisions regarding architectures, research avenues and strategies, etc. (Not just inventing transformers/the scaling hypothesis/the idea of RL-on-CoT, but picking those approaches out of a sea of ideas, and making the correct decision to commit hard to them.)
In turn, the factors relevant to (4) are:
(a) The serial thinking of the senior researchers and the communication/exchange of ideas between them.
(Where “the senior researchers” are defined as “the people with the power to make strategic research decisions at a given company”.)
(b) The outputs of significant experiments decided on by the senior researchers.
(c) The pool of untested-at-large-scale ideas presented to the senior researchers.
Importantly, in this model, speeding up (1), (2), (3) can only speed up (4) by increasing the turnover speed of (b) and the quality of (c). And I expect that non-AGI-complete AI cannot improve the quality of ideas (3) and cannot directly speed up/replace (a)[1], meaning any acceleration from it can only come from accelerating the engineering and the optimization of significant experiments.
Which, I expect, are in fact mostly bottlenecked by compute, and 10x’ing the human-labor productivity there doesn’t 10x the overall productivity of the human-labor input; it remains stubbornly held up by (a). (I do buy that it can significantly speed it up, say 2x it. But not 10x it.)
Separately, I’m also skeptical that near-term AI can speed up the nontrivial engineering involved in medium-difficult systems design and the management of significant experiments:
It seems to me that AIs have remained stubbornly terrible at this from GPT-3 to GPT-4 to Sonnet 3.5.1 to o1[2]; that the improvement on this hard-to-specify quality has been ~0. I guess we’ll see if o3 (or an o-series model based on the next-generation base model) change that. AI does feel right on the cusp of getting good at this...
… just as it felt at the time of GPT-3.5, and GPT-4, and Sonnet 3.5.1, and o1. That just the slightest improvement along this axis would allow us to plug the outputs of AI cognition into its inputs and get a competent, autonomous AI agent.
And yet here we are, still.
It’s puzzling to me and I don’t quite understand why it wouldn’t work, but based on the previous track record, I do in fact expect it not to work.
In other words: If an AI is able to improve the quality of ideas and/or reliably pluck out the best ideas from a sea of them, I expect that’s AGI and we can throw out all human cognitive labor entirely.
Arguably, no improvement since GPT-2; I think that post aged really well.
Huh, I disagree reasonably strongly with this. Possible that something along these lines is an empirically testable crux.
FWIW my vibe is closer to Thane’s. Yesterday I commented that this discussion has been raising some topics that seem worthy of a systematic writeup as fodder for further discussion. I think here we’ve hit on another such topic: enumerating important dimensions of AI capability – such as generation of deep insights, or taking broader context into account – and then kicking off a discussion of the past trajectory / expected future progress on each dimension.
Some benchmarks got saturated across this range, so we can imagine “anti-saturated” benchmarks that didn’t yet noticeably move from zero, operationalizing intuitions of lack of progress. Performance on such benchmarks still has room to change significantly even with pretraining scaling in the near future, from 1e26 FLOPs of currently deployed models to 5e28 FLOPs by 2028, 500x more.
If you were to spend equal amounts of money on LLM inference and GPUs, that would mean that you’re spending $400,000 / year on GPUs. Divide that 50 ways and each Sonnet instance gets an $8,000 / year compute budget. Over the 18 hours per day that Sonnet is waiting for experiments, that is an average of $1.22 / hour, which is almost exactly the hourly cost of renting a single H100 on Vast.
So I guess the crux is “would a swarm of unreliable researchers with one good GPU apiece be more effective at AI research than a few top researchers who can monopolize X0,000 GPUs for months, per unit of GPU time spent”.
(and yes, at some point it the question switches to “would an AI researcher that is better at AI research than the best humans make better use of GPUs than the best humans” but a that point it’s a matter of quality, not quantity)
Sure, but I think that at the relevant point, you’ll probably be spending at least 5x more on experiments than on inference and potentially a much larger larger ratio if heavy test time compute usage isn’t important. I was just trying to argue that the naive inference cost isn’t that crazy.
Notably, if you give each researcher 2k gpu hours, that would be $2 / gpu hour * 2k * 24 * 365 = $35,040,000 per year which is much higher than the inference cost of the models!