Darklight comments on Darklight’s Shortform

Darklight 3 Oct 2024 15:04 UTC
5 points
−1
I’ve been looking at the numbers with regards to how many GPUs it would take to train a model with as many parameters as the human brain has synapses. The human brain has 100 trillion synapses, and they are sparse and very efficiently connected. A regular AI model fully connects every neuron in a given layer to every neuron in the previous layer, so that would be less efficient.
The average H100 has 80 GB of VRAM, so assuming that each parameter is 32 bits, then you have about 20 billion per GPU. So, you’d need 10,000 GPUs to fit a single instance of a human brain in RAM, maybe. If you assume inefficiencies and need to have data in memory as well you could ballpark another order of magnitude so 100,000 might be needed.
For comparison, it’s widely believed that OpenAI trained GPT4 on about 10,000 A100s that Microsoft let them use from their Azure supercomputer, most likely the one listed as third most powerful in the world by the Top500 list.
Recently though, Microsoft and Meta have both moved to acquire more GPUs that put them in the 100,000 range, and Elon Musk’s X.ai recently managed to get a 100,000 H100 GPU supercomputer online in Memphis.
So, in theory at least, we are nearly at the point where they can train a human brain sized model in terms of memory. However, keep in mind that training such a model would take a ton of compute time. I haven’t done to calculations yet for FLOPS so I don’t know if it’s feasible yet.
Just some quick back of the envelope analysis.
- Vladimir_Nesov 4 Oct 2024 2:03 UTC
  5 points
  0
  Parent
  it’s widely believed that OpenAI trained GPT4 on about 10,000 A100s
  
  What I can find is 20,000 A100s. With 10K A100s, which are 300e12 FLOP/s in BF16, you’d need 6 months (so this is still plausible) at 40% utilization to get the rumored 2e25 FLOPs. We know Llama-3-405B is 4e25 FLOPs and approximately as smart, and it’s dense, so you can get away with fewer FLOPs in a MoE model to get similar capabilities, which supports the 2e25 FLOPs figure from the premise that original GPT-4 is MoE.
  
  The average H100 has 80 GB of VRAM
  
  H200s are 140 GB, and there are now MI300Xs with 192 GB. B200s will also have 192 GB.
  
  assuming that each parameter is 32 bits
  
  Training is typically in BF16, though you need enough space for gradients in addition to parameters (and with ZeRO, optimizer states). On the other hand, inference in 8 bit quantization is essentially indistinguishable from full precision.
  
  Recently though, Microsoft and Meta have both moved to acquire more GPUs that put them in the 100,000 range
  
  The word is, next year it’s 500K B200s^[1] for Microsoft. And something in the gigawatt range from Google as well.
  ↩︎
  He says 500K GB200s, but also that it’s 1 gigawatt all told, and that they are 2-3x faster than H100s, so I believe he means 500K B200s. In various places, “GB200” seems to ambiguously refer either to a 2-GPU board with a Grace CPU, or to one of the B200s on such a board.
  - Darklight 4 Oct 2024 16:31 UTC
    1 point
    0
    Parent
    Thanks for the clarifications. My naive estimate is obviously just a simplistic ballpark figure using some rough approximations, so I appreciate adding some precision.
- Darklight 3 Oct 2024 15:28 UTC
  1 point
  −2
  Parent
  Also, even if we can train and run a model the size of the human brain, it would still be many orders of magnitude less energy efficient than an actual brain. Human brains use barely 20 watts. This hypothetical GPU brain would require enormous data centres of power, and each H100 GPU uses 700 watts alone.
  - Vladimir_Nesov 4 Oct 2024 2:09 UTC
    10 points
    2
    Parent
    
    Also, even if we can train and run a model the size of the human brain, it would still be many orders of magnitude less energy efficient than an actual brain. Human brains use barely 20 watts.
    
    For inference on a GPT-4 level model, GPUs use much less than a human brain, about 1-2 watts (across all necessary GPUs), if we imagine slowing them down to human speed and split the power among the LLM instances that are being processed at the same time. Even for a 30 trillion parameter model, it might only get up to 30-60 watts in this sense.
    
    each H100 GPU uses 700 watts alone
    
    Should count the rest of the datacenter as well, which gets it up to 1200-1400 watts per H100, about 2000 watts for B200s in GB200 systems. (It’s hilarious how some model training papers make calculations using 700 watts for CO2 emission estimates. They feel obliged to make the calculations, but then cheat like there’s no tomorrow.)
    - Darklight 4 Oct 2024 16:31 UTC
      1 point
      0
      Parent
      I was not aware of these. Thanks!