Vladimir_Nesov comments on Scaling of AI training runs will slow down after GPT-5

Vladimir_Nesov 26 Apr 2024 16:55 UTC
57 points
19
Distributed training seems close enough to being a solved problem that a project costing north of a billion dollars might get it working on schedule. It’s easier to stay within a single datacenter, and so far it wasn’t necessary to do more than that, so distributed training not being routinely used yet is hardly evidence that it’s very hard to implement.

There’s also this snippet in the Gemini report:

Training Gemini Ultra used a large fleet of TPUv4 accelerators owned by Google across multiple datacenters. [...] we combine SuperPods in multiple datacenters using Google’s intra-cluster and inter-cluster network. Google’s network latencies and bandwidths are sufficient to support the commonly used synchronous training paradigm, exploiting model parallelism within superpods and data-parallelism across superpods.

I think the crux for feasibility of further scaling (beyond $10-$50 billion) is whether systems with currently-reasonable cost keep getting sufficiently more useful, for example enable economically valuable agentic behavior, things like preparing pull requests based on feature/bug discussion on an issue tracker, or fixing failing builds. Meaningful help with research is a crux for reaching TAI and ASI, but it doesn’t seem necessary for enabling existence of a $2 trillion AI company.
What links here?
- Scaling of AI training runs will slow down after GPT-5 by Maxime Riché (26 Apr 2024 16:05 UTC; 40 points)
- Scaling of AI training runs will slow down after GPT-5 by Maxime_Riche (EA Forum; 26 Apr 2024 16:06 UTC; 10 points)
- Maxime Riché 26 Apr 2024 20:35 UTC
  1 point
  0
  Parent
  Thank for the great comment!
  
  Do we know if distributed training is expected to scale well to GPT-6 size models (100 trillions parameters) trained over like 20 data centers? How does the communication cost scale with the size of the model and the number of data centers? Linearly on both?
  After reading for 3 min this:
  Google Cloud demonstrates the world’s largest distributed training job for large language models across 50000+ TPU v5e chips (Google November 2023). It seems that scaling is working efficiently at least up to 50k GPUs (GPT-6 would be like 2.5M GPUs). There are also some surprising linear increases in start time with the number of GPUs, 13min for 32k GPUs. What is the SOTA?