I think the methods you describe for gaining data suffice to get more data in <1 year than a human expert sees in a lifetime
Which is why I don’t expect big delays from this
But I agree this stuff will be a bottleneck
I think the methods you describe for gaining data suffice to get more data in <1 year than a human expert sees in a lifetime
Which is why I don’t expect big delays from this
But I agree this stuff will be a bottleneck
For the intelligence explosion I think all we need is experience with AI RnD?
There is then a further question, if that IE goes very far, of whether AI will generalize to chemistry etc
I’d have thought “yes” bc you achieve somewhat superhuman sample efficiency and can quickly get as much chemistry exp as a human expert
Yeah agreed—AIs will improve in sample efficiency and will generate experience in all aspects of AI RnD in parallel, but they’ll be weakest in the areas where they have the least experience
Although if most alg breakthroughs provide bigger benefits at higher compute scales, then that suggests that the pace of alg progress has only been sustained by moving to bigger compute scales.
Then when compute is held constant, we’ll face much steeper DMR.
Might be possible to estimate the size of this effect quantitatively by looking at how much smaller the gains are at lower compute scales, and how quickly we’ve scaled up compute.
This is an additional point point to being bottlenecked on compute for experiments.
Hmm, I’m not sure I buy the analogy here. Can’t people just run parametric experiments at smaller scale? E.g., search over a really big space, do evolution style stuff, etc?
Yeah agree parametric/evolution stuff changes things.
But if you couldn’t do that stuff, do you agree cognitive labour would plausibly have been a hard bottleneck?
If so, that does seem analogous to if we scale up cognitive labour by 3 OOMs. After all, i’m not sure what the analogue of “parametric experiments” is when you have abundant cognitive labour and limited compute.
This view would imply that experiments at substantially smaller (but absolutely large) scale don’t generalize up to a higher scale or at least very quickly hit dimishing returns in generalizing up to higher scale which seems a bit implausible to me.
Agree this is an implication. (It’s an implication of any view where compute can be a hard bottleneck—past a certain point you learn 10X less info by running an experiment at a 10X smaller scale.)
But why implausible? Could we have developed RLHF, prompting, tool-use, and reasoning models via loads of experiments on GPT-2 scale models? Does make sense to me that those models just aren’t smart enough to learn any of this and your experiments have 0 signal.
An alternative option is to just reduce the frontier scale with AIs
Yeah I think this is a plausible strategy. If you can make 100X faster progress at the 10^26 scale than the 10^27 scale, why not do it.
Also, I think I haven’t seen anyone articulate this view other than you in a comment responding to me earlier, so I didn’t think this exact perspective was that important to address.
Well unfortunately the people actively defending the view that compute will be a bottleneck haven’t been specific about what the think the functional form is. They’ve just said vague things like “compute for experiments is a bottleneck”. In that post I initially gave the simplest model for concretising that claim, and you followed suite in this post when talking about “7 OOMs”, but I don’t think anyone’s said that model represents their view than the ‘near frontier experiments’ model.
Yeah, you can get into other fancy tricks to defend it like:
Input-specific technological progress. Even if labour has grown more slowly than capital, maybe the ‘effective labour supply’—which includes tech makes labour more productive (e.g. drinking caffeine, writing faster on a laptop) -- has grown as fast as capital.
Input-specific ‘stepping on toes’ adjustments. If capital grows at 10%/year and labour grows at 5%/year, but (effective labour) = labour^0.5, and (effective capital)=capital, then the growth rates of effective labour and effective capital are equal
My sense is that labs have scaled up frontier training runs faster than they’ve scaled up their supply of compute.
I.e. in 2015 the biggest training runs would have been <<10% of OAI’s compute, but that’s no longer true today.
Not confident in this!
Thanks, this is a great comment.
I buy your core argument against the main CES model I presented. I think your key argument (“the relative quantity of labour vs compute has varied by OOMs; If CES were true, then one input would have become a hard bottleneck; but it hasn’t.”) is pretty compelling as an objection to the simple naive CES I mention in the post. It updates me even further towards thinking that, if you use this naive CES, you should have ρ> −0.2. Thanks!
The core argument is less powerful against a more realistic CES model that replaces ‘compute’ with ‘near-frontier sized experiments’. I’m less sure how strong it is as an argument against the more-plausible version of the CES where rather than inputs of cognitive labour and compute we have inputs of cognitive labour and number of near-frontier-sized experiments. (I discuss this briefly in the post.) I.e. if a lab has total compute C_t, its frontier training run takes C_f compute, and we say that a ‘near-frontier-sized’ experiment uses 1% as much compute as training a frontier model, then the number of near-frontier sized experiments that the lab could run equals E = 100* C_t / C_f
With this formulation, it’s no longer true that a key input has increased by many OOMs, which was the core of your objection (at least the part of your objection that was about the actual world rather than about hypotheticals—i discuss hypotheticals below.)
Unlike compute, E hasn’t grown by many OOMs over the last decade. How has it changed? I’d guess it’s gotten a bit smaller over the past decade as labs have scaled frontier training runs faster than they’ve scaled their total quantity of compute for running experiments. But maybe labs have scaled both at equal pace, especially as in recent years the size of pre-training has been growing more slowly (excluding GPT-4.5).
So this version of the CES hypothesis fares better against your objection, bc the relative quantity of the two inputs (cognitive labour and number of near-frontier experiments) have changed by less over the past decade. Cognitive labour inputs have grown by maybe 2 OOMs over the past decade, but the ‘effective labour supply’, adjusting for diminishing quality and stepping-on-toes effects has grown by maybe just 1 OOM. With just 1 OOM relative increase in cognitive labour, the CES function with ρ=-0.4 implies that compute will have become more of a bottleneck, but not a complete bottleneck such that more labour isn’t still useful. And that seems roughly realistic.
Minimally, the CES view predicts that AI companies should be spending less and less of their budget on compute as GPUs are getting cheaper. (Which seems very false.)
This version of the CES hypothesis also dodges this objection. AI companies need to spend much more on compute over time just to keep E constant and avoid compute becoming a bottleneck.
This model does make a somewhat crazy prediction where it implies that if you scale up compute and labor exactly in parallel, eventually further labor has no value. (I suppose this could be true, but seems a bit wild.)
Doesn’t seem that wild to me? When we scale up compute we’re also scaling up the size of frontier training runs; maybe past a certain point running smaller experiments just isn’t useful (e.g. you can’t learn anything from experiments using 1 billionth of the compute of a frontier training run); and maybe past a certain point you just can’t design better experiments. (Though I agree with you that this is all unlikely to bite before a 10X speed up.)
consider a hypothetical AI company with the same resources as OpenAI except that they only employ aliens whose brains work 10x slower and for which the best researcher is roughly as good as OpenAI’s median technical employee
Nice thought experiment.
So the near-frontier-experiment version of the CES hypothesis would say that those aliens would be in a world where experimental compute isn’t a bottleneck on AI progress at all: the aliens don’t have time to write the code to run the experiments they have the compute for! And we know we’re not in that world because experiments are a real bottleneck on our pace of progress already: researchers say they want more compute! These hypothetical aliens would make no such requests. It may be a weird empirical coincidence that cognitive labour helps up to our current level but not that much further, but we can confirm with the evidence of the marginal value of compute in our world.
But actually I do agree the CES hypothesis is pretty implausible here. More compute seems like it would still be helpful for these aliens: e.g. automated search over different architectures and running all experiments at large scale. And evolution is an example where the “cognitive labour” going into AI R&D was very very very minimal and still having lots of compute to just try stuff out helped.
So I think this alien hypothetical is probably the strongest argument against the near-frontier experiment version of the CES hypothesis. I don’t think it’s devastating—the CES-advocate can bite the bullet and claim that more compute wouldn’t be at all useful in that alien world.
(Fwiw i preferred the way you described that hypothesis before your last edit.)
You can also try to ‘block’ the idea of a 10X speed up by positing a large ‘stepping on toes’ effect. If it’s v important to do experiments in series and that experiments can’t be sped up past a certain point, then experiments could still bottleneck progress. This wouldn’t be about the quantity of compute being a bottleneck per se, so it avoids your objection. Instead the bottleneck is ‘number of experiments you can run per day’. Mathematically, you could represent this by smg like:
AI progress per week = log(1000 + L^0.5 * E^0.5)
The idea is that there are ~linear gains to research effort initially, but past a certain point returns start to diminish increasingly steeply such that you’d struggle to ever realistically 10X the pace of progress.
Ultimately, I don’t really buy this argument. If you applied this functional form to other areas of science you’d get the implication that there’s no point scaling up R&D past a certain point, which has never happened in practice. And even i think the functional form underestimates has much you could improve experiment quality and how much you could speed up experiments. And you have to cherry pick the constant so that we get a big benefit from going from the slow aliens to OAI-today, but limited benefit from going from today to ASI.
Still, this kind of serial-experiment bottleneck will apply to some extent so it seems worth highlighting that this bottleneck isn’t effected by the main counterargument you made.
I don’t think my treatment of initial conditions was confused
I think your discussion (and Epoch’s discussion) of the CES model is confused as you aren’t taking into account the possibility that we’re already bottlenecking on compute or labor...
In particular, consider a hypothetical alternative world where they have the same amount of compute, but there is only 1 person (Bob) working on AI and this 1 person is as capable as the median AI company employee and also thinks 10x slower. In this alternative world they could also say “Aha, you see because , even if we had billions of superintelligences running billions of times faster than Bob, AI progress would only go up to around 4x faster!”
Of course, this view is absurd because we’re clearly operating >>4x faster than Bob.
So, you need to make some assumptions about the initial conditions....
Consider another hypothetical world where the only compute they have is some guy with an abacus, but AI companies have the same employees they do now. In this alternative world, you could also have just as easily said “Aha, you see because , even if we had GPUs that could do 1e15 FLOP/s (far faster than our current rate of 1e-1 fp8 FLOP/s), AI progress would only go around 4x faster!”
My discussion does assume that we’re not currently bottlenecked on either compute or labour, but I think that assumption is justified. It’s quite clear that labs both want more high-quality researchers—top talent has very high salaries, reflecting large marginal value-add. It’s also clear that researchers want more compute—again reflecting large marginal value-add. So it’s seems clear that we’re not strongly bottlenecked by just one of compute or labour currently. That’s why I used α=0.5, assuming that the elasticity of progress to both inputs is equal. (I don’t think this is exactly right, but seems in the right rough ballpark.)
I think your thought experiments about the the world with just one researcher and the world with just an abacus are an interesting challenge to the CES function, but don’t imply that my treatment of initial conditions was confused.
I actually don’t find those two examples very convincing though as challenges to the CES. In both those worlds it seems pretty plausible that the scarce input would be a hard bottleneck on progress. If all you have is an abacus, then probably the value of the marginal AI researcher would be ~0 as they’d have no compute to use. And so you couldn’t run the argument “ρ= − 0.4 and so more compute won’t help much” because in that world (unlike our world) it will be very clear that compute is a hard bottleneck to progress and cognitive labour isn’t helpful. And similarly, in the world with just Bob doing AI R&D, it’s plausible that AI companies would have ~0 willingness to pay for more compute for experiments, as Bob can’t use the compute that he’s already got; labour is the hard bottleneck. So again you couldn’t run the argument based on ρ=-0.4 bc that argument only works if neither input is currently a hard bottleneck.
Can you spell out more how “Automated empirical research on process-focused evaluation methods” helps us to automate conceptual alignment research?
I get that we could get much better at understanding model psychology and how model’s generalise.
But why does that mean we can now automate conceptual work like “is this untestable idea a good contribution to alignment”?
Yep, I think this is a plausible suggestion. Labs can plausibly train models that are v internally useful without being helpful only, and could fine-tune models for evals on a case-by-case basis (and delete the weights after the evals).
Agreed there’s an ultimate cap on software improvements—the worry is that it’s very far away!
It does sound like a lot—that’s 5 OOMs to reach human learning efficiency and then 8 OOMs more. But when we BOTECed the sources of algorithmic efficiency gain on top of the human brain, it seemed like you could easily get more than 8. But agreed it seems like a lot. Though we are talking about ultimate physical limits here!
Interesting re the early years. So you’d accept that learning from 5⁄6 could be OOMs more efficient, but would deny that the early years could be improved?
Though you’re not really speaking to the ‘undertrained’ point, which is about the number of params vs data points
I expect that full stack intelligence explosion could look more like “make the whole economy bigger using a bunch of AI labor” rather than specifically automating the chip production process. (That said, in practice I expect explicit focused automation of chip production to be an important part of the picture, probably the majority of the acceleration effect.) Minimally, you need to scale up energy at some point.
Agreed on the substance, we just didn’t explain this well.
You talk about “chip technology” feedback loop as taking months, but presumably improvements to ASML take longer as they often require building new fabs?
Agreed!
Re Flop/joule also agree on the substance—we went with FLOP/joule bc we wanted a clean estimate for the OOMs before reaching limits for each factor. I believe our estimate of the total OOMs to limits (including both chip tech and chip production) is right, but you’re right that there are ways to intutively improve chip tech that don’t increase FLOP/joule
I think rushing full steam ahead with AI increases human takeover risk
Here’s my own estimate for this parameter:
Once AI has automated AI R&D, will software progress become faster or slower over time? This depends on the extent to which software improvements get harder to find as software improves – the steepness of the diminishing returns.
We can ask the following crucial empirical question:
When (cumulative) cognitive research inputs double, how many times does software double?
(In growth models of a software intelligence explosion, the answer to this empirical question is a parameter called r.)
If the answer is “< 1”, then software progress will slow down over time. If the answer is “1”, software progress will remain at the same exponential rate. If the answer is “>1”, software progress will speed up over time.
The bolded question can be studied empirically, by looking at how many times software has doubled each time the human researcher population has doubled.
(What does it mean for “software” to double? A simple way of thinking about this is that software doubles when you can run twice as many copies of your AI with the same compute. But software improvements don’t just improve runtime efficiency: they also improve capabilities. To incorporate these improvements, we’ll ultimately need to make some speculative assumptions about how to translate capability improvements into an equivalently-useful runtime efficiency improvement..)
The best quality data on this question is Epoch’s analysis of computer vision training efficiency. They estimate r = ~1.4: every time the researcher population doubled, training efficiency doubled 1.4 times. (Epoch’s preliminary analysis indicates that the r value for LLMs would likely be somewhat higher.) We can use this as a starting point, and then make various adjustments:
Upwards for improving capabilities. Improving training efficiency improves capabilities, as you can train a model with more “effective compute”. To quantify this effect, imagine we use a 2X training efficiency gain to train a model with twice as much “effective compute”. How many times would that double “software”? (I.e., how many doublings of runtime efficiency would have the same effect?) There are various sources of evidence on how much capabilities improve every time training efficiency doubles: toy ML experiments suggest the answer is ~1.7; human productivity studies suggest the answer is ~2.5. I put more weight on the former, so I’ll estimate 2. This doubles my median estimate to r = ~2.8 (= 1.4 * 2).
Upwards for post-training enhancements. So far, we’ve only considered pre-training improvements. But post-training enhancements like fine-tuning, scaffolding, and prompting also improve capabilities (o1 was developed using such techniques!). It’s hard to say how large an increase we’ll get from post-training enhancements. These can allow faster thinking, which could be a big factor. But there might also be strong diminishing returns to post-training enhancements holding base models fixed. I’ll estimate a 1-2X increase, and adjust my median estimate to r = ~4 (2.8*1.45=4).
Downwards for less growth in compute for experiments. Today, rising compute means we can run increasing numbers of GPT-3-sized experiments each year. This helps drive software progress. But compute won’t be growing in our scenario. That might mean that returns to additional cognitive labour diminish more steeply. On the other hand, the most important experiments are ones that use similar amounts of compute to training a SOTA model. Rising compute hasn’t actually increased the number of these experiments we can run, as rising compute increases the training compute for SOTA models. And in any case, this doesn’t affect post-training enhancements. But this still reduces my median estimate down to r = ~3. (See Eth (forthcoming) for more discussion.)
Downwards for fixed scale of hardware. In recent years, the scale of hardware available to researchers has increased massively. Researchers could invent new algorithms that only work at the new hardware scales for which no one had previously tried to to develop algorithms. Researchers may have been plucking low-hanging fruit for each new scale of hardware. But in the software intelligence explosions I’m considering, this won’t be possible because the hardware scale will be fixed. OAI estimate ImageNet efficiency via a method that accounts for this (by focussing on a fixed capability level), and find a 16-month doubling time, as compared with Epoch’s 9-month doubling time. This reduces my estimate down to r = ~1.7 (3 * 9⁄16).
Downwards for diminishing returns becoming steeper over time. In most fields, returns diminish more steeply than in software R&D. So perhaps software will tend to become more like the average field over time. To estimate the size of this effect, we can take our estimate that software is ~10 OOMs from physical limits (discussed below), and assume that for each OOM increase in software, r falls by a constant amount, reaching zero once physical limits are reached. If r = 1.7, then this implies that r reduces by 0.17 for each OOM. Epoch estimates that pre-training algorithmic improvements are growing by an OOM every ~2 years, which would imply a reduction in r of 1.02 (6*0.17) by 2030. But when we include post-training enhancements, the decrease will be smaller (as [reason], perhaps ~0.5. This reduces my median estimate to r = ~1.2 (1.7-0.5).
Overall, my median estimate of r is 1.2. I use a log-uniform distribution with the bounds 3X higher and lower (0.4 to 3.6).
I’ll paste my own estimate for this param in a different reply.
But here are the places I most differ from you:
Bigger adjustment for ‘smarter AI’. You’ve argue in your appendix that, only including ‘more efficient’ and ‘faster’ AI, you think the software-only singularity goes through. I think including ‘smarter’ AI makes a big difference. This evidence suggests that doubling training FLOP doubles output-per-FLOP 1-2 times. In addition, algorithmic improvements will improve runtime efficiency. So overall I think a doubling of algorithms yields ~two doublings of (parallel) cognitive labour.
--> software singularity more likely
Lower lambda. I’d now use more like lambda = 0.4 as my median. There’s really not much evidence pinning this down; I think Tamay Besiroglu thinks there’s some evidence for values as low as 0.2. This will decrease the observed historical increase in human workers more than it decreases the gains from algorithmic progress (bc of speed improvements)
--> software singularity slightly more likely
Complications thinking about compute which might be a wash.
Number of useful-experiments has increased by less than 4X/year. You say compute inputs have been increasing at 4X. But simultaneously the scale of experiments ppl must run to be near to the frontier has increased by a similar amount. So the number of near-frontier experiments has not increased at all.
This argument would be right if the ‘usefulness’ of an experiment depends solely on how much compute it uses compared to training a frontier model. I.e. experiment_usefulness = log(experiment_compute / frontier_model_training_compute). The 4X/year increases the numerator and denominator of the expression, so there’s no change in usefulness-weighted experiments.
That might be false. GPT-2-sized experiments might in some ways be equally useful even as frontier model size increases. Maybe a better expression would be experiment_usefulness = alpha * log(experiment_compute / frontier_model_training_compute) + beta * log(experiment_compute). In this case, the number of usefulness-weighted experiments has increased due to the second term.
--> software singularity slightly more likely
Steeper diminishing returns during software singularity. Recent algorithmic progress has grabbed low-hanging fruit from new hardware scales. During a software-only singularity that won’t be possible. You’ll have to keep finding new improvements on the same hardware scale. Returns might diminish more quickly as a result.
--> software singularity slightly less likely
Compute share might increase as it becomes scarce. You estimate a share of 0.4 for compute, which seems reasonable. But it might fall over time as compute becomes a bottleneck. As an intuition pump, if your workers could think 1e10 times faster, you’d be fully constrained on the margin by the need for more compute: more labour wouldn’t help at all but more compute could be fully utilised so the compute share would be ~1.
--> software singularity slightly less likely
--> overall these compute adjustments prob make me more pessimistic about the software singularity, compared to your assumptions
Taking it all together, i think you should put more probability on the software-only singluarity, mostly because of capability improvements being much more significant than you assume.
One idea that seems potentially promising is to have a single centralised project and minimize the chance it becomes too powerful by minimizing its ability to take actions in the broader world.
Concretely, a ‘Pre-Training Project’ does pre-training and GCR safety assessment, post-training needed for the above activities (including post-training to make AI R&D agents and evaluating the safety of post-training techniques), and nothing else. And then have many (>5) companies that do fine-tuning, scaffolding, productising, selling API access, and use-case-specific safety assessments.
Why is this potentially the best of both worlds?
Much less concentration of power. The Pre-Training Project is strictly banned from these further activities (and indeed from any other activities) and it is closely monitored. This significantly reduces the (massive and very problematic) concentration of power you’d get from just one project selling AGI services to the world. It can’t shape the uses of the technology to its own private benefit, can’t charge monopoly prices, can’t use its superhuman AI and massive profits for political lobbying and shaping public opinion. Instead, multiple private companies will compete to ensure that the rest of the world gets maximum benefit from the tech.
More work is needed to see whether the power of the Pre-Training Project could really be robustly limited in this way.
No ‘race to the bottom’ within the west. Only one project is allowed to increase the effective compute used in pre-training. It’s not racing with other Western projects, so there is no ‘race to the bottom’. (Though obviously international racing here could still be a problem.)
I agree we should weight frontier-taste-relevant data more heavily, but I think my point still goes through.
With that weighting, human experts have much less frontier-taste-relevant data than they have total data, and it’s still true that AIs could acquire more of that data than an expert in <1 year.
As AI advances SOTA technology it will in general have more data than humans on the frontier, as AI can learn from all copies. E.g. if it takes 1000 experiments to advance the frontier one step, human experts might typically experience 10 experiments at each step, whereas AI will experience 1000.