The idea that AI can remain a mere tool seems deeply analogous to the idea that humanity can keep itself from building AI. Technically possible, but both fail for the same nontechnical reasons. A position that AI can’t (rather than shouldn’t) be paused but can remain a mere tool is yet more tenuous (even though magically settling there for now would be the optimal outcome, keeping what utility can be extracted safely).
Vladimir_Nesov
That’s why I used the “no new commercial breakthroughs” clause, $300bn training systems by 2029 seem in principle possible both technically and financially without an intelligence explosion, just not with the capabilities legibly demonstrated so far. On the other hand, pre-training as we know it will end[1] in any case soon thereafter, because at ~current pace a 2034 training system would need to cost $15 trillion (it’s unclear if manufacturing can be scaled at this pace, and also what to do with that much compute, because there isn’t nearly enough text data, but maybe pre-training on all the video will be important for robotics).
How far RL scales remains unclear, and even at the very first step of scaling o3 doesn’t work as clear evidence because it’s still unknown if it’s based on GPT-4o or GPT-4.5 (it’ll become clearer once there’s an API price and more apples-to-apples speed measurements).
- ↩︎
This is of course a quote from Sutskever’s talk. It was widely interpreted as saying it has just ended, in 2024-2025, but he never put a date on it. I don’t think it will end before 2027-2028.
- ↩︎
Without an intelligence explosion, it’s around 2030 that scaling through increasing funding runs out of steam and slows down to the speed of chip improvement. This slowdown happens around the same time (maybe 2028-2034) even with a lot more commercial success (if that success precedes the slowdown), because scaling faster takes exponentially more money. So there’s more probability density of transformative advances before ~2030 than after, to the extent that scaling contributes to this probability.
That’s my reason to see 2030 as a meaningful threshold, Thane Ruthenis might be pointing to it for different reasons. It seems like it should certainly be salient for AGI companies, so a long timelines argument might want to address their narrative up to 2030 as a distinct case.
(Substantially edited my comment to hopefully make the point clearer.)
Risk of gradual disempowerment (erosion of control) or short term complete extinction from AI may sound sci-fi if one didn’t live taking the idea seriously for years, but it won’t be solved using actually sci-fi methods that have no prospect of becoming reality. It’s not the consequence that makes a problem important, it is that you have a reasonable attack.
There needs to be a sketch of how any of this can actually be done, and I don’t mean the technical side. On the technical side you can just avoid building AI until you really know what you are doing, it’s not a problem with any technical difficulty, but the way human society works doesn’t allow this to be a feasible plan in today’s world.
It won’t go from genius-level to supergenius to superhuman (at general problem-solving or specific domains) overnight. It could take years to make progress in a more human-like style.
But AI speed advantage? It’s 100x-1000x faster, so years become days to weeks. Compute for experiments is plausibly a bottleneck that makes it take longer, but at genius human level decades of human theory and software development progress (things not bottlenecked on experiments) will be made by AIs in months. That should help a lot in making years of physical time unlikely to be necessary, to unlock more compute efficient and scalable ways of creating smarter AIs.
Cyberattacks can’t disable anything with any reliability or for more than days to weeks though, and there are dozens of major datacenter campuses from multiple somewhat independent vendors. Hypothetical AI-developed attacks might change that, but then there will also be AI-developed information security, adapting to any known kinds of attacks and stopping them from being effective shortly after. So the MAD analogy seems tenuous, the effect size (of this particular kind of intervention) is much smaller, to the extent that it seems misleading to even mention cyberattacks in this role/context.
I’m not sure raw compute (as opposed to effective compute) GPT-6 (10,000x GPT-4) by 2029 is plausible (without new commercial breakthroughs). Nvidia Rubin is 2026-2027 (models trained on it 2027-2029), so a 2029 model plausibly uses the next architecture after (though it’s more likely to come out in early 2030 then, not 2029). Let’s say it’s 1e16 FLOP/s per chip (BF16, 4x B200) with time cost $4/hour (2x H100), that is $55bn to train for 2e29 FLOPs and 3M chips in the training system if it needs 6 months at 40% utilization (reinforcing the point that 2030 is a more plausible timing, 3M chips is a lot to manufacture). Training systems with H100s cost $50K per chip all-in to build (~BOM not TCO), so assuming it’s 2x more for the after-Rubin chips the training system costs $300B to build. Also, a Blackwell chip needs 2 KW all-in (a per-chip fraction of the whole datacenter), so the after-Rubin chip might need 4 KW, and 3M chips need 12 GW.
These numbers need to match the scale of the largest AI companies. A training system ($300bn in capital, 3M of the newest chips) needs to be concentrated in the hands of a single company, probably purpose-built. And then at least $55bn of its time needs to be spent (the rest can go on the inference market after). But in practice almost certainly many times more (using earlier training systems) in experiments that make the final training run possible, so the savings from using the training system for inference after the initial large training run don’t really reduce the total cost of the model. The AI company would still need to find approximately the same $300bn to get it done.
The largest spend of similar character is Amazon’s 2025 capex of $100bn. A cloud provider builds datacenters, while an AI company might primarily train models instead. So an AI company wouldn’t necessarily need to worry about things other than the $300bn dedicated to the frontier model project, while a cloud provider would still need to build inference capacity.
- 6 Mar 2025 4:28 UTC; 19 points) 's comment on A Bear Case: My Predictions Regarding AI Progress by (
Key constraints are memory for storing KV-caches, scale-up world size (a smaller collection of chips networked at much higher bandwidth than outside such collections), and the number of concurrent requests. A model needs to be spread across many chips to fit in memory, leave enough space for KV-caches, and run faster. If there aren’t enough requests for inference, all these chips will be mostly idle, but the API provider will still need to pay for their time. If the model is placed on fewer chips, it won’t be able to process too many requests at all because otherwise the chips will run out of memory for KV-caches, and also each request will be processed slower.
So there is a minimal threshold in the number of users needed to serve a model at a given speed with a low cost. GB200 NVL72 is going to change this a lot, since it ups scale-up world size from 8 to 72 GPUs, and a B200 chip has 192 GB of HBM to H100′s 80 GB (though H200s have 141 GB). This allows to fit the same model on fewer chips while maintaining high speed (using fewer scale-up worlds) and processing many concurrent requests (having enough memory for many KV-caches), so the inference prices for larger models will probably collapse and Hoppers will become more useful for training experiments than inference (other than for the smallest models). It’s a greater change than between A100s and H100s, since both had 8 GPUs per scale-up world.
I think most of the trouble is conflating recent models like GPT-4o with GPT-4, when they are instead ~GPT-4.25. It’s plausible that some already use 4x-5x compute of original GPT-4 (an H100 produces 3x compute of an A100), and that GPT-4.5 uses merely 3x-4x more compute than any of them. The distance between them and GPT-4.5 in raw compute might be quite small.
It shouldn’t be at all difficult to find examples where GPT-4.5 is better than the actual original GPT-4 of March 2023, it’s not going to be subtle. Before ChatGPT there were very few well-known models at each scale, but now the gaps are all filled in by numerous models of intermediate capability. It’s the sorites paradox, not yet evidence of slowdown.
Oversight, auditing, and accountability are jobs. Agriculture shows that 95% of jobs going away is not the problem. But AI might be better at the new jobs as well, without any window of opportunity where humans are initially doing them and AI needs to catch up. Instead it’s AI that starts doing all the new things well first and humans get no opportunity to become competitive at anything, old or new, ever again.
Even formulation of aligned high-level tasks and intent alignment of AIs make sense as jobs that could be done well by misaligned AIs for instrumental reasons. Which is not even deceptive alignment, but still plausibly segues into gradual disempowerment or sharp left turn.
Agency (proficient tool use with high reliability for costly actions) might be sufficient to maintain an RSI loop (as an engineer, using settled methodology) even while lacking crucial capabilities (such as coming up with important novel ideas), eventually developing those capabilities without any human input. But even if it works like this, the AI speed advantage might be negated by lack of those capabilities, so that human-led AI research is still faster and mostly bottlenecked by availability and cost of compute.
1e26 FLOP would have had a significant opportunity cost.
At the end of 2023 Microsoft had 150K+ H100s, so reserving 30K doesn’t seem like too much (especially as they can use non-H100 and possibly non-Microsoft compute for research experiments). It’s difficult to get a lot of a new chip when it just comes out, or to get a lot in a single training system, or to suddenly get much more if demand surges. But for a frontier training run, there would’ve been months of notice. And the opportunity cost of not doing this is being left with an inferior model (or a less overtrained model that costs more in inference, and so requires more GPUs to serve for inference).
I don’t think it’s a good idea to reason backwards from alleging some compute budget that OpenAI might have had at X date, to inferring the training FLOP of a model trained then.
The main anchors are 32K H100s in a single training system, and frontier training compute scaling 4x per year. Currently, a year later, 3e26-6e26 FLOPs models are getting released (based on 100K H100s in Colossus and numbers in the Grok 3 announcement, 100K H100s at Goodyear site, 100K TPUv6e datacenters, Meta’s 128K H100s). The $3bn figure was just to point out that $140m following from such anchors is not a very large number.
65T tokens doesn’t get you to 1e26 FLOP with 100B active params?
Right, 45T-65T is for a compute optimal 1e26 model, I did the wrong calculation when editing in this detail. For a 10x overtrained model, it’s 3x more data than that, so for 150T total tokens you’d need 5 epochs of 30T tokens, which is still feasible (with almost no degradation compared to 150T unique tokens of that quality). The aim was to calculate this from 260B and 370B reduced 3x (rather than from 100B).
GPT-4.5 being trained on fewer tokens than GPT-4o doesn’t really make sense.
How so? If it uses 3x more compute but isn’t 10x overtrained, that means less data (with multiple epochs, it would probably use exactly the same unique data, repeated a bit less). The video presentation on GPT-4.5 mentioned work on lower precision in pretraining, so it might even be a 6e26 FLOPs model (though a priori it would be surprising if the first foray into this scale isn’t taken at the more conservative BF16). And it would still be less data (square root of 6x is less than 3x). Overtraining has a large effect on both the number of active parameters and the needed number of tokens, at a relatively minor cost in effective compute, thus it’s a very salient thing for use in production models.
There is a report that OpenAI might’ve been intending to spend $3bn on training in 2024 (presumably mostly for many smaller research experiments), and a claim that the Goodyear site has 3 buildings hosting 100K H100s. One of these buildings is 32K H100s, which at 40% utilization in 3 months produces 1e26 FLOPs (in BF16), which in GPU-time at $2/hour costs $140m. So it seems plausible that Azure already had one of these (or identical) datacenter buildings when GPT-4o was ready to train, and that $140m wasn’t too much for a flagship model that carries the brand for another year.
With this amount of compute and the price of $2.5 per 1M input tokens, it’s unlikely to be compute optimal. For MoEs at 1e26 FLOPs, it might be compute optimal to have 120-240 tokens/parameter (for 1:8-1:32 sparsity), which is 370B active parameters for a 1:8 sparse MoE or 260B for a 1:32 sparse MoE. Dense Llama-3-405B was $5 per 1M input tokens at probably slimmer margins, so GPT-4o needs to be more like 100B active parameters. Thus 3x less parameters than optimal and 3x more data than optimal (about
45T-65T135T-190T trained-on tokens, which is reasonable as3-45 epochs of15T-20T25T-40T unique tokens), giving 10x overtraining in the value of tokens/parameter compared to compute optimal.The penalty from 10x overtraining is a compute multiplier of about 0.5x, so a 5e25 FLOPs compute optimal model would have similar performance, but it would have 2x more active parameters than a 10x overtrained 1e26 FLOPs model, which at $70m difference in cost of training should more than pay for itself.
Full Grok 3 only had a month for post-training, and keeping responses on general topics reasonable is a fiddly semi-manual process. They didn’t necessarily have the R1-Zero idea either, which might make long reasoning easier to scale automatically (as long as you have enough verifiable tasks, which is the thing that plausibly fails to scale very far).
Also, running long reasoning traces for a big model is more expensive and takes longer, so the default settings will tend to give smaller reasoning models more tokens to reason with, skewing the comparison.
My point is that a bit of scaling (like 3x) doesn’t matter, even though at the scale of GPT-4.5 or Grok 3 it requires building a $5bn training system, but a lot of scaling (like 2000x up from the original GPT-4) is still the most important thing impacting capabilities that will predictably happen soon. And it’s going to arrive a little bit at a time, so won’t be obviously impactful at any particular step, not doing anything to disrupt the rumors of no longer being important. It’s a rising sea kind of thing (if you have the compute).
Long reasoning traces were always necessary to start working at some point, and s1 paper illustrates that we don’t really have evidence yet that R1-like training creates rather than elicits nontrivial capabilities (things that wouldn’t be possible to transfer in mere 1000 traces). Amodei is suggesting that RL training can be scaled to billions of dollars, but unclear if this assumes that AIs will automate creation of verifiable tasks. If constructing such tasks (or very good reward models) is the bottleneck, this direction of scaling can’t quickly get very far outside specialized domains like chess where a single verifiable task (winning a game) generates endless data.
The quality data wall and flatlining benchmarks (with base model scaling) are about compute multipliers that depend on good data but don’t scale very far. As opposed to scalable multipliers like high sparsity MoE. So I think these recent 4x a year compute multipliers mostly won’t work above 1e27-1e28 FLOPs, which superficially looks bad for scaling of pretraining, but won’t impact the less legible aspects of scaling token prediction (measured in perplexity on non-benchmark data) that are more important for general intelligence. There’s also the hard data wall of literally running out of text data, but being less stringent on data quality and training for multiple epochs (giving up the ephemeral compute multipliers from data quality) should keep it at bay for now.
- 6 Mar 2025 5:31 UTC; 6 points) 's comment on A Bear Case: My Predictions Regarding AI Progress by (
my intuitions have been shaped by events like the pretraining slowdown
I don’t see it. GPT-4.5 is much better than the original GPT-4, probably at 15x more compute. But it’s not 100x more compute. And GPT-4o is an intermediate point, so the change from GPT-4o to GPT-4.5 is even smaller, maybe 4x.
I think 3x change in compute has an effect at the level of noise from different reasonable choices in constructing a model, and 100K H100s is only 5x more than 20K H100s of 2023. It’s not a slowdown relative to what it should’ve been. And there are models with 200x more raw compute than went into GPT-4.5 that are probably coming in 2027-2029, much more than the 4x-15x observed since 2022-2023.
Deep Research … will rapidly improve—when GPT-4.5 arrives soon and is integrated into the underlying reasoning model
Deep Research is based on o3, but it’s unclear if o3 is based on GPT-4o or GPT-4.5. Knowledge cutoff for GPT-4.5 is Oct 2023, the announcement about training the next frontier model was in May 2024, so it plausibly finished pretraining by Sep-Oct 2024, in time to use as a foundation for o3.
It might still rapidly improve even if based on GPT-4.5 if RL training is scalable, but that also remains unknown, the reasoning models so far don’t come with scaling laws for RL training, it’s plausible that this is bottlenecked on manual construction of verifiable tasks, which can’t be scaled 1000x.
Sane pausing similarly must be temporary, gated by theory and the experiments it endorses. Pausing is easier to pull off than persistently-tool AI, since it’s further from dangerous capabilities, so it’s not nearly as ambiguous when you take steps outside the current regime (such as gradual disempowerment). RSPs for example are the strategy of being extremely precise so that you stop just before the risk of falling off the cliff becomes catastrophic, and not a second earlier.