This is why people in Silicon Valley are talking about power plants.
Another concern is powering an individual datacenter campus in order to train a GPT-6 level model (going by 30x a generation, that’s a $10 billion/3M H100/6GW training run). I previously thought the following passage from Gemini 1.0 report clearly indicates that at least Google is ready to spread training compute across large distances:
Training Gemini Ultra used a large fleet of TPUv4 accelerators owned by Google across multiple datacenters. [...] we combine SuperPods in multiple datacenters using Google’s intra-cluster and inter-cluster network. Google’s network latencies and bandwidths are sufficient to support the commonly used synchronous training paradigm, exploiting model parallelism within superpods and data-parallelism across superpods.
But in principle inter-cluster network could just be referring to connecting different clusters within a single campus, which doesn’t help with the local power requirements constraint. (This SemiAnalysis post offers some common sense on the topic.)
Without that, there are research papers on asynchronous distributed training, which haven’t seen enough scaling to be a known way forward yet. As this kind of thing isn’t certain to work, plans are probably being made for the eventuality that it doesn’t, which means directing 6 GW of power to a single datacenter campus.
The pressure of decentralization at this scale will also incentivize a lot more research on how to do search/planning. Otherwise you wind up with a lot of ‘stranded’ GPU/TPU capacity where they are fully usable, but aren’t necessary for serving old models and can’t participate in the training of new scaled-up models. But if you switch to a search-and-distill-centric approach, suddenly all of your capacity comes online.
I came to the comments section to say a similar thing. Right now, the easiest way for companies to push the frontier of capabilities is via throwing more hardware and electricity at the problem, as well as doing some efficiency improvements. If the cost or unavailability of electrical power or hardware were to become a bottleneck, then that would simply tilt the equation more in the direction of searching for more compute efficient methods.
I believe there’s plenty of room to spend more on research there and get decent returns for investment, so I doubt the compute bottleneck would make much of a difference. I’m pretty sure we’re already well into a compute-overhang regime in terms of what the compute costs of more efficient model architectures would be like.
I think the same is true for data, and the potential to spend more research investment on looking for more data efficient algorithms.
Another concern is powering an individual datacenter campus in order to train a GPT-6 level model (going by 30x a generation, that’s a $10 billion/3M H100/6GW training run). I previously thought the following passage from Gemini 1.0 report clearly indicates that at least Google is ready to spread training compute across large distances:
But in principle inter-cluster network could just be referring to connecting different clusters within a single campus, which doesn’t help with the local power requirements constraint. (This SemiAnalysis post offers some common sense on the topic.)
Without that, there are research papers on asynchronous distributed training, which haven’t seen enough scaling to be a known way forward yet. As this kind of thing isn’t certain to work, plans are probably being made for the eventuality that it doesn’t, which means directing 6 GW of power to a single datacenter campus.
The pressure of decentralization at this scale will also incentivize a lot more research on how to do search/planning. Otherwise you wind up with a lot of ‘stranded’ GPU/TPU capacity where they are fully usable, but aren’t necessary for serving old models and can’t participate in the training of new scaled-up models. But if you switch to a search-and-distill-centric approach, suddenly all of your capacity comes online.
I came to the comments section to say a similar thing. Right now, the easiest way for companies to push the frontier of capabilities is via throwing more hardware and electricity at the problem, as well as doing some efficiency improvements. If the cost or unavailability of electrical power or hardware were to become a bottleneck, then that would simply tilt the equation more in the direction of searching for more compute efficient methods.
I believe there’s plenty of room to spend more on research there and get decent returns for investment, so I doubt the compute bottleneck would make much of a difference. I’m pretty sure we’re already well into a compute-overhang regime in terms of what the compute costs of more efficient model architectures would be like. I think the same is true for data, and the potential to spend more research investment on looking for more data efficient algorithms.