Recursive self-improvement in AI probably comes before AGI. Evolution doesn’t need to understand human minds to build them, and a parent doesn’t need to be an AI researcher to make a child. The bitter lesson and the practice of recent years suggest that building increasingly capable AIs doesn’t depend on understanding how they think.
Thus the least capable AI that can build superintelligence without human input only needs to be a competent engineer that can scale and refine a sufficiently efficient AI design, in an empirically driven mundane way that doesn’t depend on matching capabilities of Grothendieck for conceptual invention. This makes the threshold of AGI less relevant for timelines of recursive self-improvement than I previously expected. With o1 and what straightforwardly follows, we plausibly already have all it takes to get recursive self-improvement, if the current designs get there with the next few years of scaling, and the resulting AIs are merely competent engineers that fail to match humans at less legible technical skills.
I think you’re doing a “we just need X” with recursive self-improvement. The improvement may be iterable and self-applicable… but is it general? Is it on a bounded trajectory or an unbounded trajectory? Very different outcomes.
Yeah, although I am bullish on the general direction of RSI, I also think that in the details it factors into many dimensions of improvement. Some of which are likely fast-but-bounded and will quickly plateau, others which are slow-but-not-near-term-bounded… The fact that there are many different dimensions over which RSI might operate makes it hard to predict precisely, but does give some general predictions.
For instance, we might expect it not to be completely blocked (since there will be many independent dimensions along which to apply optimization pressure, so blocking one won’t block them all).
Another prediction we might make is that seeing some rapid progress doesn’t guarantee that either a complete wall will be hit soon or that progress will continue just as fast or faster. Things might just be messy, with a jagged inconsistent line proceeding up and to the right. Zoom out enough, and it may look smooth, but for our very-relevant-to-us near-term dynamics, it could just be quite noisy.
Technically this probably isn’t recursive self improvement, but rather automated AI progress. This is relevant mostly because
It implies that, at least through the early parts of the takeoff, there will be a lot of individual AI agents doing locally-useful compute-efficiency and improvement-on-relevant-benchmarks things, rather than one single coherent agent following a global plan for configuring the matter in the universe in a way that maximizes some particular internally-represented utility function.
It means that multi-agent dynamics will be very relevant in how things happen
If your threat model is “no group of humans manages to gain control of the future before human irrelevance”, none of this probably matters.
No group of AIs needs to gain control before human irrelevance either. Like a runaway algal bloom AIs might be able to bootstrap superintelligence, without crossing the threshold of AGI being useful in helping them gain control over this process any more than humans maintain such control at the outset. So it’s not even multi-agent dynamics shaping the outcome, capitalism might just serve as the nutrients until a much higher threshold of capability where a superintelligence can finally take control of this process.
Cutting edge AI research is one of the most difficult tasks humans are currently working on, so the intelligence requirement to replace human researchers is quite high. It is likely that most ordinary software development, being easier, will be automated before AI research is automated. I’m unsure whether LLMs with long chains of thought (o1-like models) can reach this level of intelligence before human researchers invent a more general AI architecture.
Humans are capable of solving conceptually difficult problems, so they do. An easier path might be possible that doesn’t depend on such capabilities, and doesn’t stall for their lack, like evolution doesn’t stall for lack of any mind at all. If there is more potential for making models smarter alien tigers by scaling RL in o1-like post-training, and the scaling proceeds to 1 gigawatt and then 35 gigawatt training systems, it might well be sufficient to get an engineer AI that can improve such systems further, at 400x and then 10,000x the compute of GPT-4.
Before o1, there was a significant gap, the mysterious absence of System 2 capabilities, with only vague expectation that they might emerge or become easier to elicit from scaled up base models. This uncertainty no longer gates engineering capabilities of AIs. I’m still unsure that scaling directly can make AIs capabile of novel conceptual thought, but AIs becoming able to experimentally iterate on AI designs seems likely, and that in turn seems sufficient to eventually mutate these designs towards remaining missing capabilities.
(It’s useful to frame most ideas as exploratory engineering rather than forecasting. The question of whether something can happen, or can be done, doesn’t need to be contextualized within the question of whether it will happen or will be done. Physical experiments are done under highly contrived conditions, and similarly we can conduct thought experiments or conceptual arguments under fantastical or even physically impossible conditions. Thus I think Carl Shulman’s human levelAGI world is a valid exploration of the future of AI, even though I don’t believe that most of what he describes happens in actuality before superintelligence changes the premise. It serves as a strong argument for industrial and economic growth driven by AGI, even though it almost entirely consists of describing events that can’t possibly happen.)
Cutting edge AI research seems remarkably and surprisingly easy compared to other forms of cutting edge science. Most things work on the first try, clever insights aren’t required, it’s mostly an engineering task of scaling compute.
This seems like the sort of R&D that China is good at: research that doesn’t need superstar researchers and that is mostly made of incremental improvements. But yet they don’t seem to be producing top LLMs. Why is that?
China is producing research in a number of areas right now that is surpassing the West and arguably more impressive scientifically than producing top LLMs.
A big reason China is lagging a little bit might be political interference at major tech companies. Xi Jinping instigated a major crackdown recently.
There is also significantly less Chinese text data. I am not a China or tech expert so these sre just guesses.
In any case, I wouldn’t assign it to much significance. The AI space is just moving so quickly that even a minor year delay can seem like lightyears. But that doesnt mean that Chinese companies cant so it or that a country-continent with 1,4 billion people and a history of many technological firsts cant scale up a transformer.
New AWS Trainium 2 cluster offers compute equivalent to 250K H100s[1], and under this assumption Anthropic implied[2] their previous compute was 50K H100s (possibly what was used to train Claude 3.5 Opus).
So their current or imminent models are probably 1e26-2e26 FLOPs (2-4 months on 50K H100s at 40% compute utilization in BF16)[3], and the upcoming models in mid to late 2025 will be 5e26-1e27 FLOPs, ahead of what 100K H100s clusters of other players (possibly except Google) can deliver by that time.
SemiAnalysis gives an estimate of 24-27 kilowatts per 32 Trainium 2 chips, so 200K Trn2s need 150 megawatts. The 7 datacenter buildings in the northern part of the New Carlisle AWS site are 65 megawatts each according to SemiAnalysis. That’s enough for 600K Trn2s, so the figure of 400K Trn2s probably refers to those buildings alone, rather than also to the second phase of the project scheduled for next year. At 0.65e15 dense BF16 FLOP/s each, 400K Trn2s produce as much compute as 250K H100s.
At 4 months, with $2/hour, this takes $300 million, which is at odds with $100 million Dario Amodei gestured at in Jun 2024, but that only applies to Claude 3.5 Sonnet, not Opus. So Opus 3.5 (if it does come out) might be a 2e26 FLOPs model, while Sonnet 3.5 a 7e25-1e26 FLOPs model. On the other hand, $2 per H100-hour is not AWS prices, at those prices Sonnet 3.5 might be capped at 4e25 FLOPs, same as Llama-3-405B.
For OpenAI, there are currently 3 datacenter buildings[1] near Phoenix Goodyear Airport that Dylan Patel is claiming are 48 megawatts each and filled with H100s, for about 100K H100s. This probably got online around May 2024, the reason for the announcement and the referent of Kevin Scott’s blue whale slide.
There are claims about a future cluster of 300K B200s and a geographically distributed training system of 500K-700K B200s, but with B200s deliveries in high volume to any given customer might only start in early to mid 2025, so these systems will probably get online only towards end of 2025. In the meantime, Anthropic might have a lead in having the largest cluster, even if they spend less on compute for smaller experiments overall. It might take a while to get it working, but there might be a few months there. And given how good Claude 3.5 Sonnet is, together with the above musings on how it’s plausibly merely 4e25 FLOPs based on Dario Amodei’s (somewhat oblique) claim about cost, additionally getting compute advantage in training a frontier model could carry them quite far.
There are 4.5 buildings now at that site, but you can see with Google Street View from Litchfield Rd that in Sep 2024 only the first 3 had walls, so the 4th is probably not yet done.
Re: OpenAI’s compute, I inferred from this NYT article that their $8.7B costs this year were likely to include about $6B in compute costs, which implies an average use of ~274k H100s throughout the year[1] (assuming $2.50/hr average H100 rental price). Assuming this was their annual average, I would’ve guessed they’d be on track to be using around 400k H100s by now.
So the 150k H100s campus in Phoenix might be only a small fraction of the total compute they have access to? Does this sound plausible?
The co-location of the Trainium2 cluster might give Anthropic a short term advantage, though I think its actually quite unclear if their networking and topology will fully enable this advantage. Perhaps the OpenAI Phoenix campus is well-connected enough to another OpenAI campus to be doing a 2-campus asynchronous training run effectively.
Training as it’s currently done needs to happen within a single cluster (though this might changesoon). The size of the cluster constrains how good a model can be trained within a few months. Everything that isn’t training of a frontier model can happen using many smaller clusters, something like 16 to 4096 accelerators each. You can use a lot of these smaller clusters, but they can be sourced from anywhere and built piecemeal at multiple sites with smaller power allocations, while the big training cluster needs to be a single purposefully built system.
So I expect the big expenses are inference and many training experiments with smaller models. What I’m discussing here is the big cluster for training frontier models rather than the aggregate of the small clusters for other purposes. See also this comment.
Training as it’s currently done needs to happen within a single cluster
I think that’s probably wrong, or at least effectively wrong. Gemini 1.0, trained a year ago has the following info in the technical report:
TPUv4 accelerators are deployed in “SuperPods” of 4096 chips... TPU accelerators primarily communicate over the high speed inter-chip-interconnect, but at Gemini Ultra scale, we combine SuperPods in multiple datacenters using Google’s intra-cluster and inter-cluster network (Poutievski et al., 2022; Wetherall et al., 2023; yao Hong et al., 2018). Google’s network latencies and bandwidths are sufficient to support the commonly used synchronous training paradigm, exploiting model parallelism within superpods and data-parallelism across superpods.
As you note, public distributed training methods have advanced beyond basic data parallelism (though they have not been publicly shown at large model scales because nobody has really tried yet).
This might require bandwidth of about 300 Tbps for 500K B200s systems (connecting their geographically distributed parts), based on the below estimate. It gets worse with scale.
The “cluster” label applied in this context might be a bit of a stretch, for example the Llama 3 24K H100s cluster is organized in pods of 3072 GPUs, and the pods themselves are unambiguously clusters, but at the top level they are connected with 1:7 oversubscription (Section 3.3.1).
Only averaged gradients need to be exchanged at the top level, once at each optimizer step (minibatch). Llama 3 405B has about 1M minibatches with about 6 seconds per step[1], which means latency doesn’t matter, only bandwidth. I’m not sure what precision is appropriate for averaging gradients, but at 4 bytes per weight that’s 1.6TB of data to be sent each way in much less than 6 seconds, say in 1 second. This is bandwidth of 12 Tbps, which fits in what a single fiber of a fiber optic cable can transmit. Overland cables are laid with hundreds of fibers, so datacenters within the US can probably get at least one fiber of bandwidth between them.
Overly large minibatches are bad for quality of training, and with H100s in a standard setup only 8 GPUs are within NVLink scaleup domains that enable tensor parallelism. If each token sequence is processed on 8 GPUs (at a given stage of pipeline parallelism), that makes it necessary to process 2K sequences at once (Llama 3 only uses 16K GPUs in its training), and with 8K tokens per sequence that’s our 16M tokens per minibatch, for 1M minibatches[2]. But if scaleup domains were larger and enabled more tensor parallelism (for an appropriately large model), there would be fewer sequences processed simultaneously for smaller minibatches, so the time between optimizer steps would decrease, from Llama 3 405B’s 6 seconds down to less than that, making the necessary gradient communication bandwidth higher.
Some B200s come as NVL72 machines with 72 GPUs per scaleup domain. And with more weights there’ll be more data in the gradients for those models. Llama 3 405B has 16Kx53K matrices and 8K token sequences, so at 3TB/s and 1e15 FLOP/s (in an H100), you need tiles of size at least 1000x1000 to get sufficient arithmetic intensity. The scaleup network is a bit over 3 times slower than HBM, which is almost sufficient to move along the results (and starts to fit if we increase the inner dimension, with the tiles no longer square). So as far as I understand (could be very wrong, without experience to anchor the numbers), in principle there is enough there for a bit less than 8 times 16 times 53 GPUs to work with (tiling multiplication of a 16Kx53K matrix by a 53Kx8K matrix in squares of 1Kx1K), more than 1000 of such GPUs could participate in tensor parallelism for Llama 3 405B if the network could handle it, so in particular the 72 GPUs of NVL72 are few enough that they could run such multiplications with tensor parallelism.
With 72 B200s per NVLink domain in a 500K B200s system, that’s 7K sequences per minibatch, 3x more than for Llama 3 405B[3]. The compute per second, and so per training run, is larger than with 16K H100s by a factor of 80, so by Chinchilla scaling law a dense model would be about 9 times larger, 3.5T parameters. So the model is 9x larger, processed over 9x more GPUs (per NVLink domain) that are 2.5 times faster, which means an optimizer step is 2.5 times shorter. This assumes that the sequence length stays 8K (if it’s higher then so is the time between optimizer steps, reducing the necessary bandwidth). Transmitting gradients for 9x more weights in that time requires bandwidth that’s 20 times higher, about 300 Tbps.
That’s still within the realm of possibility, some oceanfloor cables feature bandwidth on the same order of magnitude, and overland cables should enable more, but it’s no longer likely to be trivial, could require actually laying the cables between the datacenter campus sites, which could take a long time to get all the permissions and to do the construction.
16K GPUs at 40% utilization for about 4e25 dense BF16 FLOPs, which is 40% of 1e15 FLOP/s for each GPU. And 16M tokens/minibatch (Table 4) out of about 16T tokens in total.
This gives another way of getting the estimate of 6 seconds per step, which doesn’t depend on the size of the cluster at all. The compute for 1 sequence is 6 times 405B parameters times 8K tokens, processed by 8 GPUs (at some pipeline parallelism stage), each at a rate of 1e15 FLOP/s with 40% utilization on average, so it takes them 6 seconds to process a sequence.
So making NVLink domains 9x larger only kept the problem of large minibatches from getting more than 3 times worse. This is still much better than 150K sequences per minibatch if the same compute was assembled in the form of 1200K H100s with 8 GPUs per NVLink domain.
And in a way, they ought to be rolling in even more compute than it looks because they are so much more focused: Anthropic isn’t doing image generation, it isn’t doing voice synthesis, it isn’t doing video generation… (As far as we know they aren’t researching those, and definitely not serving it to customers like OA or Google.) It does text LLMs. That’s it.
But nevertheless, an hour ago, working on a little literary project, I hit Anthropic switching my Claude to ‘concise’ responses to save compute. (Ironically, I think that may have made the outputs better, not worse, for that project, because Claude tends to ‘overwrite’, especially in what I was working on.)
I’d guess that the amount spent on image and voice is negligible for this BOTEC?
I do think that the amount spent on inference for customers should be a big deal though. My understanding is that OpenAI has a much bigger userbase than Anthropic. Shouldn’t that mean that, all else equal, Anthropic has more compute to spare for training & experiments? Such that if Anthropic has about as much compute total, they in effect have a big compute advantage?
Yi-Lightning (01 AI) Chatbot Arena results are suprisingly strong for its price, which puts it at about 10B active parameters[1]. It’s above Claude 3.5 Sonnet and GPT-4o in Math, above Gemini 1.5 Pro 002 in English and Hard Prompts (English). It’s above all non-frontier models in Coding and Hard Prompts (both with Style Control), including Qwen-2.5-72B (trained on 18T tokens). Interesting if this is mostly a better methodology or compute scaling getting taken more seriously for a tiny model.
The developer’s site says it’s a MoE model. Developer’s API docs list it at ¥0.99/1M tokens. The currency must be Renminbi, so that’s about $0.14. Together serves Llama-3-8B for $0.10-0.18 (per million tokens), Qwen-2.5-7B for $0.30, all MoE models up to 56B total (not active) parameters for $0.60. (The prices for open weights models won’t have significant margins, and model size is known, unlike with lightweight closed models.)
Yi-Lightning is a small MOE model that is extremely fast and inexpensive. Yi-Lightning costs only $0.14 (RMB0.99 ) /mil tokens [...] Yi-Lightning was pre-trained on 2000 H100s for 1 month, costing about $3 million, a tiny fraction of Grok-2.
Assuming it’s trained in BF16 with 40% compute utilization, that’s a 2e24 FLOPs model (Llama-3-70B is about 6e24 FLOPs, but it’s not MoE, so the FLOPs are not used as well). Assuming from per token price that it has 10-20B active parameters, it’s trained on 15-30T tokens. So not an exercise in extreme compute scaling, just excellent execution.
Recursive self-improvement in AI probably comes before AGI. Evolution doesn’t need to understand human minds to build them, and a parent doesn’t need to be an AI researcher to make a child. The bitter lesson and the practice of recent years suggest that building increasingly capable AIs doesn’t depend on understanding how they think.
Thus the least capable AI that can build superintelligence without human input only needs to be a competent engineer that can scale and refine a sufficiently efficient AI design, in an empirically driven mundane way that doesn’t depend on matching capabilities of Grothendieck for conceptual invention. This makes the threshold of AGI less relevant for timelines of recursive self-improvement than I previously expected. With o1 and what straightforwardly follows, we plausibly already have all it takes to get recursive self-improvement, if the current designs get there with the next few years of scaling, and the resulting AIs are merely competent engineers that fail to match humans at less legible technical skills.
The bitter lesson says that there are many things you don’t need to understand, but it doesn’t say you don’t need to understand anything.
I think you’re doing a “we just need X” with recursive self-improvement. The improvement may be iterable and self-applicable… but is it general? Is it on a bounded trajectory or an unbounded trajectory? Very different outcomes.
Yeah, although I am bullish on the general direction of RSI, I also think that in the details it factors into many dimensions of improvement. Some of which are likely fast-but-bounded and will quickly plateau, others which are slow-but-not-near-term-bounded… The fact that there are many different dimensions over which RSI might operate makes it hard to predict precisely, but does give some general predictions.
For instance, we might expect it not to be completely blocked (since there will be many independent dimensions along which to apply optimization pressure, so blocking one won’t block them all).
Another prediction we might make is that seeing some rapid progress doesn’t guarantee that either a complete wall will be hit soon or that progress will continue just as fast or faster. Things might just be messy, with a jagged inconsistent line proceeding up and to the right. Zoom out enough, and it may look smooth, but for our very-relevant-to-us near-term dynamics, it could just be quite noisy.
Technically this probably isn’t recursive self improvement, but rather automated AI progress. This is relevant mostly because
It implies that, at least through the early parts of the takeoff, there will be a lot of individual AI agents doing locally-useful compute-efficiency and improvement-on-relevant-benchmarks things, rather than one single coherent agent following a global plan for configuring the matter in the universe in a way that maximizes some particular internally-represented utility function.
It means that multi-agent dynamics will be very relevant in how things happen
If your threat model is “no group of humans manages to gain control of the future before human irrelevance”, none of this probably matters.
No group of AIs needs to gain control before human irrelevance either. Like a runaway algal bloom AIs might be able to bootstrap superintelligence, without crossing the threshold of AGI being useful in helping them gain control over this process any more than humans maintain such control at the outset. So it’s not even multi-agent dynamics shaping the outcome, capitalism might just serve as the nutrients until a much higher threshold of capability where a superintelligence can finally take control of this process.
Cutting edge AI research is one of the most difficult tasks humans are currently working on, so the intelligence requirement to replace human researchers is quite high. It is likely that most ordinary software development, being easier, will be automated before AI research is automated. I’m unsure whether LLMs with long chains of thought (o1-like models) can reach this level of intelligence before human researchers invent a more general AI architecture.
Humans are capable of solving conceptually difficult problems, so they do. An easier path might be possible that doesn’t depend on such capabilities, and doesn’t stall for their lack, like evolution doesn’t stall for lack of any mind at all. If there is more potential for making models smarter alien tigers by scaling RL in o1-like post-training, and the scaling proceeds to 1 gigawatt and then 35 gigawatt training systems, it might well be sufficient to get an engineer AI that can improve such systems further, at 400x and then 10,000x the compute of GPT-4.
Before o1, there was a significant gap, the mysterious absence of System 2 capabilities, with only vague expectation that they might emerge or become easier to elicit from scaled up base models. This uncertainty no longer gates engineering capabilities of AIs. I’m still unsure that scaling directly can make AIs capabile of novel conceptual thought, but AIs becoming able to experimentally iterate on AI designs seems likely, and that in turn seems sufficient to eventually mutate these designs towards remaining missing capabilities.
(It’s useful to frame most ideas as exploratory engineering rather than forecasting. The question of whether something can happen, or can be done, doesn’t need to be contextualized within the question of whether it will happen or will be done. Physical experiments are done under highly contrived conditions, and similarly we can conduct thought experiments or conceptual arguments under fantastical or even physically impossible conditions. Thus I think Carl Shulman’s human level AGI world is a valid exploration of the future of AI, even though I don’t believe that most of what he describes happens in actuality before superintelligence changes the premise. It serves as a strong argument for industrial and economic growth driven by AGI, even though it almost entirely consists of describing events that can’t possibly happen.)
Cutting edge AI research seems remarkably and surprisingly easy compared to other forms of cutting edge science. Most things work on the first try, clever insights aren’t required, it’s mostly an engineering task of scaling compute.
This seems like the sort of R&D that China is good at: research that doesn’t need superstar researchers and that is mostly made of incremental improvements. But yet they don’t seem to be producing top LLMs. Why is that?
China is producing research in a number of areas right now that is surpassing the West and arguably more impressive scientifically than producing top LLMs.
A big reason China is lagging a little bit might be political interference at major tech companies. Xi Jinping instigated a major crackdown recently. There is also significantly less Chinese text data. I am not a China or tech expert so these sre just guesses.
In any case, I wouldn’t assign it to much significance. The AI space is just moving so quickly that even a minor year delay can seem like lightyears. But that doesnt mean that Chinese companies cant so it or that a country-continent with 1,4 billion people and a history of many technological firsts cant scale up a transformer.
@gwern
New AWS Trainium 2 cluster offers compute equivalent to 250K H100s[1], and under this assumption Anthropic implied[2] their previous compute was 50K H100s (possibly what was used to train Claude 3.5 Opus).
So their current or imminent models are probably 1e26-2e26 FLOPs (2-4 months on 50K H100s at 40% compute utilization in BF16)[3], and the upcoming models in mid to late 2025 will be 5e26-1e27 FLOPs, ahead of what 100K H100s clusters of other players (possibly except Google) can deliver by that time.
SemiAnalysis gives an estimate of 24-27 kilowatts per 32 Trainium 2 chips, so 200K Trn2s need 150 megawatts. The 7 datacenter buildings in the northern part of the New Carlisle AWS site are 65 megawatts each according to SemiAnalysis. That’s enough for 600K Trn2s, so the figure of 400K Trn2s probably refers to those buildings alone, rather than also to the second phase of the project scheduled for next year. At 0.65e15 dense BF16 FLOP/s each, 400K Trn2s produce as much compute as 250K H100s.
Anthropic’s post: “This cluster will deliver more than five times the computing power used to train our current generation of leading AI models.”
At 4 months, with $2/hour, this takes $300 million, which is at odds with $100 million Dario Amodei gestured at in Jun 2024, but that only applies to Claude 3.5 Sonnet, not Opus. So Opus 3.5 (if it does come out) might be a 2e26 FLOPs model, while Sonnet 3.5 a 7e25-1e26 FLOPs model. On the other hand, $2 per H100-hour is not AWS prices, at those prices Sonnet 3.5 might be capped at 4e25 FLOPs, same as Llama-3-405B.
Are you saying Anthropic actually has more compute (in the relevant sense) than OpenAI right now? That feels like a surprising claim, big if true.
For OpenAI, there are currently 3 datacenter buildings[1] near Phoenix Goodyear Airport that Dylan Patel is claiming are 48 megawatts each and filled with H100s, for about 100K H100s. This probably got online around May 2024, the reason for the announcement and the referent of Kevin Scott’s blue whale slide.
There are claims about a future cluster of 300K B200s and a geographically distributed training system of 500K-700K B200s, but with B200s deliveries in high volume to any given customer might only start in early to mid 2025, so these systems will probably get online only towards end of 2025. In the meantime, Anthropic might have a lead in having the largest cluster, even if they spend less on compute for smaller experiments overall. It might take a while to get it working, but there might be a few months there. And given how good Claude 3.5 Sonnet is, together with the above musings on how it’s plausibly merely 4e25 FLOPs based on Dario Amodei’s (somewhat oblique) claim about cost, additionally getting compute advantage in training a frontier model could carry them quite far.
There are 4.5 buildings now at that site, but you can see with Google Street View from Litchfield Rd that in Sep 2024 only the first 3 had walls, so the 4th is probably not yet done.
Thanks Vladimir, this is really interesting!
Re: OpenAI’s compute, I inferred from this NYT article that their $8.7B costs this year were likely to include about $6B in compute costs, which implies an average use of ~274k H100s throughout the year[1] (assuming $2.50/hr average H100 rental price). Assuming this was their annual average, I would’ve guessed they’d be on track to be using around 400k H100s by now.
So the 150k H100s campus in Phoenix might be only a small fraction of the total compute they have access to? Does this sound plausible?
The co-location of the Trainium2 cluster might give Anthropic a short term advantage, though I think its actually quite unclear if their networking and topology will fully enable this advantage. Perhaps the OpenAI Phoenix campus is well-connected enough to another OpenAI campus to be doing a 2-campus asynchronous training run effectively.
$6e9 / 365.25d / 24h / $2.5/hr = 274k
Training as it’s currently done needs to happen within a single cluster (though this might change soon). The size of the cluster constrains how good a model can be trained within a few months. Everything that isn’t training of a frontier model can happen using many smaller clusters, something like 16 to 4096 accelerators each. You can use a lot of these smaller clusters, but they can be sourced from anywhere and built piecemeal at multiple sites with smaller power allocations, while the big training cluster needs to be a single purposefully built system.
So I expect the big expenses are inference and many training experiments with smaller models. What I’m discussing here is the big cluster for training frontier models rather than the aggregate of the small clusters for other purposes. See also this comment.
Patel’s claim is 100K H100s at 150 megawatts.
I think that’s probably wrong, or at least effectively wrong. Gemini 1.0, trained a year ago has the following info in the technical report:
As you note, public distributed training methods have advanced beyond basic data parallelism (though they have not been publicly shown at large model scales because nobody has really tried yet).
This might require bandwidth of about 300 Tbps for 500K B200s systems (connecting their geographically distributed parts), based on the below estimate. It gets worse with scale.
The “cluster” label applied in this context might be a bit of a stretch, for example the Llama 3 24K H100s cluster is organized in pods of 3072 GPUs, and the pods themselves are unambiguously clusters, but at the top level they are connected with 1:7 oversubscription (Section 3.3.1).
Only averaged gradients need to be exchanged at the top level, once at each optimizer step (minibatch). Llama 3 405B has about 1M minibatches with about 6 seconds per step[1], which means latency doesn’t matter, only bandwidth. I’m not sure what precision is appropriate for averaging gradients, but at 4 bytes per weight that’s 1.6TB of data to be sent each way in much less than 6 seconds, say in 1 second. This is bandwidth of 12 Tbps, which fits in what a single fiber of a fiber optic cable can transmit. Overland cables are laid with hundreds of fibers, so datacenters within the US can probably get at least one fiber of bandwidth between them.
Overly large minibatches are bad for quality of training, and with H100s in a standard setup only 8 GPUs are within NVLink scaleup domains that enable tensor parallelism. If each token sequence is processed on 8 GPUs (at a given stage of pipeline parallelism), that makes it necessary to process 2K sequences at once (Llama 3 only uses 16K GPUs in its training), and with 8K tokens per sequence that’s our 16M tokens per minibatch, for 1M minibatches[2]. But if scaleup domains were larger and enabled more tensor parallelism (for an appropriately large model), there would be fewer sequences processed simultaneously for smaller minibatches, so the time between optimizer steps would decrease, from Llama 3 405B’s 6 seconds down to less than that, making the necessary gradient communication bandwidth higher.
Some B200s come as NVL72 machines with 72 GPUs per scaleup domain. And with more weights there’ll be more data in the gradients for those models. Llama 3 405B has 16Kx53K matrices and 8K token sequences, so at 3TB/s and 1e15 FLOP/s (in an H100), you need tiles of size at least 1000x1000 to get sufficient arithmetic intensity. The scaleup network is a bit over 3 times slower than HBM, which is almost sufficient to move along the results (and starts to fit if we increase the inner dimension, with the tiles no longer square). So as far as I understand (could be very wrong, without experience to anchor the numbers), in principle there is enough there for a bit less than 8 times 16 times 53 GPUs to work with (tiling multiplication of a 16Kx53K matrix by a 53Kx8K matrix in squares of 1Kx1K), more than 1000 of such GPUs could participate in tensor parallelism for Llama 3 405B if the network could handle it, so in particular the 72 GPUs of NVL72 are few enough that they could run such multiplications with tensor parallelism.
With 72 B200s per NVLink domain in a 500K B200s system, that’s 7K sequences per minibatch, 3x more than for Llama 3 405B[3]. The compute per second, and so per training run, is larger than with 16K H100s by a factor of 80, so by Chinchilla scaling law a dense model would be about 9 times larger, 3.5T parameters. So the model is 9x larger, processed over 9x more GPUs (per NVLink domain) that are 2.5 times faster, which means an optimizer step is 2.5 times shorter. This assumes that the sequence length stays 8K (if it’s higher then so is the time between optimizer steps, reducing the necessary bandwidth). Transmitting gradients for 9x more weights in that time requires bandwidth that’s 20 times higher, about 300 Tbps.
That’s still within the realm of possibility, some oceanfloor cables feature bandwidth on the same order of magnitude, and overland cables should enable more, but it’s no longer likely to be trivial, could require actually laying the cables between the datacenter campus sites, which could take a long time to get all the permissions and to do the construction.
16K GPUs at 40% utilization for about 4e25 dense BF16 FLOPs, which is 40% of 1e15 FLOP/s for each GPU. And 16M tokens/minibatch (Table 4) out of about 16T tokens in total.
This gives another way of getting the estimate of 6 seconds per step, which doesn’t depend on the size of the cluster at all. The compute for 1 sequence is 6 times 405B parameters times 8K tokens, processed by 8 GPUs (at some pipeline parallelism stage), each at a rate of 1e15 FLOP/s with 40% utilization on average, so it takes them 6 seconds to process a sequence.
So making NVLink domains 9x larger only kept the problem of large minibatches from getting more than 3 times worse. This is still much better than 150K sequences per minibatch if the same compute was assembled in the form of 1200K H100s with 8 GPUs per NVLink domain.
And in a way, they ought to be rolling in even more compute than it looks because they are so much more focused: Anthropic isn’t doing image generation, it isn’t doing voice synthesis, it isn’t doing video generation… (As far as we know they aren’t researching those, and definitely not serving it to customers like OA or Google.) It does text LLMs. That’s it.
But nevertheless, an hour ago, working on a little literary project, I hit Anthropic switching my Claude to ‘concise’ responses to save compute. (Ironically, I think that may have made the outputs better, not worse, for that project, because Claude tends to ‘overwrite’, especially in what I was working on.)
I’d guess that the amount spent on image and voice is negligible for this BOTEC?
I do think that the amount spent on inference for customers should be a big deal though. My understanding is that OpenAI has a much bigger userbase than Anthropic. Shouldn’t that mean that, all else equal, Anthropic has more compute to spare for training & experiments? Such that if Anthropic has about as much compute total, they in effect have a big compute advantage?
Yi-Lightning (01 AI) Chatbot Arena results are suprisingly strong for its price, which puts it at about 10B active parameters[1]. It’s above Claude 3.5 Sonnet and GPT-4o in Math, above Gemini 1.5 Pro 002 in English and Hard Prompts (English). It’s above all non-frontier models in Coding and Hard Prompts (both with Style Control), including Qwen-2.5-72B (trained on 18T tokens). Interesting if this is mostly a better methodology or compute scaling getting taken more seriously for a tiny model.
The developer’s site says it’s a MoE model. Developer’s API docs list it at ¥0.99/1M tokens. The currency must be Renminbi, so that’s about $0.14. Together serves Llama-3-8B for $0.10-0.18 (per million tokens), Qwen-2.5-7B for $0.30, all MoE models up to 56B total (not active) parameters for $0.60. (The prices for open weights models won’t have significant margins, and model size is known, unlike with lightweight closed models.)
Kai-Fu Lee, CEO of 01 AI, posted on LinkedIn:
Assuming it’s trained in BF16 with 40% compute utilization, that’s a 2e24 FLOPs model (Llama-3-70B is about 6e24 FLOPs, but it’s not MoE, so the FLOPs are not used as well). Assuming from per token price that it has 10-20B active parameters, it’s trained on 15-30T tokens. So not an exercise in extreme compute scaling, just excellent execution.