yeru comments on 10/50/90% chance of GPT-N Transformative AI?

yeru 9 Aug 2020 23:50 UTC
2 points
What about Fugaku the current fastest supercomputer with 1 exaflops in single or further reduced precision. What’s the cost of training a 100 trillion model with it?
https://www.top500.org/news/japan-captures-top500-crown-arm-powered-supercomputer/
- sairjy 10 Aug 2020 8:23 UTC
  3 points
  Parent
  The Scaling Laws for Neural Language Model’s paper says that the optimal model size scales 5x with 10x more compute. So to be more precise, using GPT-3 numbers (4000 PetaFLOPs/days for 200 billions parameters), a 100 trillion parameters model would require 4000 ExaFLOPs/days. (using GPT-3 architecture, so no sparse or linear transformer improvements). To be fair, the Scaling Law papers also predicts a breaking down of the scaling laws around 1 trillion parameters.
  The peak F16 performance of Fugaku seems to be 2 exaFLOPs. If we are generous and we account for 30% peak hardware utilization in training a transformer model, the same efficiency of an optimized large GPU cluster, it would take around 6000 days (20 years).
  Fugaku seems to have cost 1B$, which leads me to believe that GPUs are much better at F16 flops per $ than the ARM SVE architecture they use. In any case, even if we use GPUs, it is clear we are some years away if we don’t find a more efficient neural language model architecture.