In EA there is a lot of chatter about OpenAI being evil and why you should do this coding bootcamp to work at Anthropic.
However there are a number of other competitors—not least of which Elon Musk—in the race to AGI. Since there is little meaningful moat beyond scale [and the government is likely to be involved soon] all the focus on the minutia of OpenAI & Anthropic may very well end up misplaced.
all the focus on the minutia of OpenAI & Anthropic may very well end up misplaced.
This doesn’t follow. The fact that OpenAI and Anthropic are racing contributes to other people like Musk deciding to race, too. This development just means that there’s one more company to criticize.
The concrete news is a new $6 billion round, which enables xAI to follow through on the intention to add another 100K H100s (or a mix of H100s and H200s) to the existing 100K H100s. The timeline for a million GPUs remains unknown (and the means of powering them at that facility even more so).
Going fast with 1M H100s might be a bad idea if the problem with large minibatch sizes I hypothesize is real, that large minibatch sizes are both very bad and hard to avoid in practice when staying with too many H100s. (This could even be the reason for underwhelming scaling outcomes of the current wave of scaling, if that too is real, though not for Google.)
Aiming for 1M B200s only doubles or triples Microsoft’s planned 300K-700K B200s, so it’s not a decisive advantage and even less meaningful without a timeline (at some point Microsoft could be doubling or tripling training compute as well).
For the next few months Anthropic might have the compute lead (over OpenAI, Meta, xAI; Google is harder to guess). And if the Rainier cluster uses Trn2 Ultra rather than regular Trn2, there won’t even be a minibatch size problem there (if the problem is real), as unlike with H100s that form 8-GPU scale-up domains, the Trn2 Ultra machines have 64-GPU scale-up domains, for 41 units of H100-equivalent compute per scale-up domain.
I mean, here are twocomments I wrote three weeks ago, in a shortform about Musk being able to take action against Altman via his newfound influence in government:
That might very well help, yes. However, two thoughts, neither at all well thought out: … Musk’s own track record on AI x-risk is not great. I guess he did endorse California’s SB 1047, so that’s better than OpenAI’s current position. But he helped found OpenAI, and recently founded another AI company. There’s a scenario where we just trade extinction risk from Altman’s OpenAI for extinction risk from Musk’s xAI.
And:
I’m sympathetic to Musk being genuinely worried about AI safety. My problem is that one of his first actions after learning about AI safety was to found OpenAI, and that hasn’t worked out very well. Not just due to Altman; even the “Open” part was a highly questionable goal. Hopefully Musk’s future actions in this area would have positive EV, but still.
I believe that building massive data centers are the biggest risk atm and in the near future. I don’t think open AI/Anthropic will get to AGI, but rather someone copying biology will. In that case probably the bigger the datacenter around when that happens, the bigger the risk. For example a 1million GPU with current tech doesn’t get super AI, but when we figure out the architecture, it suddenly becomes much more capable and dangerous. That is from IQ 100 up to 300 with a large overhang. If the data center was smaller, then the overhang is smaller. The scenario I have in mind is someone figures AGI out, then one way or another the secret gets adopted suddenly by the large data center.
For that reason I believe focus on FLOPS for training runs is misguided, its hardware concentration and yearly worldwide HW production capacity that is more important.
Elon building massive 1 million gpu data center in Tennessee. Tens of billions of dollars. Intends to leapfrog competitors.
EA handwringing about Sam Altman & anthropicstanning suddenly pretty silly?
I don’t understand how the second sentence follows from the first?
In EA there is a lot of chatter about OpenAI being evil and why you should do this coding bootcamp to work at Anthropic. However there are a number of other competitors—not least of which Elon Musk—in the race to AGI. Since there is little meaningful moat beyond scale [and the government is likely to be involved soon] all the focus on the minutia of OpenAI & Anthropic may very well end up misplaced.
This doesn’t follow. The fact that OpenAI and Anthropic are racing contributes to other people like Musk deciding to race, too. This development just means that there’s one more company to criticize.
The concrete news is a new $6 billion round, which enables xAI to follow through on the intention to add another 100K H100s (or a mix of H100s and H200s) to the existing 100K H100s. The timeline for a million GPUs remains unknown (and the means of powering them at that facility even more so).
Going fast with 1M H100s might be a bad idea if the problem with large minibatch sizes I hypothesize is real, that large minibatch sizes are both very bad and hard to avoid in practice when staying with too many H100s. (This could even be the reason for underwhelming scaling outcomes of the current wave of scaling, if that too is real, though not for Google.)
Aiming for 1M B200s only doubles or triples Microsoft’s planned 300K-700K B200s, so it’s not a decisive advantage and even less meaningful without a timeline (at some point Microsoft could be doubling or tripling training compute as well).
For the next few months Anthropic might have the compute lead (over OpenAI, Meta, xAI; Google is harder to guess). And if the Rainier cluster uses Trn2 Ultra rather than regular Trn2, there won’t even be a minibatch size problem there (if the problem is real), as unlike with H100s that form 8-GPU scale-up domains, the Trn2 Ultra machines have 64-GPU scale-up domains, for 41 units of H100-equivalent compute per scale-up domain.
I mean, here are two comments I wrote three weeks ago, in a shortform about Musk being able to take action against Altman via his newfound influence in government:
And:
Yes you have a point.
I believe that building massive data centers are the biggest risk atm and in the near future. I don’t think open AI/Anthropic will get to AGI, but rather someone copying biology will. In that case probably the bigger the datacenter around when that happens, the bigger the risk. For example a 1million GPU with current tech doesn’t get super AI, but when we figure out the architecture, it suddenly becomes much more capable and dangerous. That is from IQ 100 up to 300 with a large overhang. If the data center was smaller, then the overhang is smaller. The scenario I have in mind is someone figures AGI out, then one way or another the secret gets adopted suddenly by the large data center.
For that reason I believe focus on FLOPS for training runs is misguided, its hardware concentration and yearly worldwide HW production capacity that is more important.