Aaron_Scher comments on Report on Frontier Model Training

Aaron_Scher 31 Aug 2023 17:44 UTC
2 points
2
However, over the past several years progress has been made on utilizing fewer bits per number (also called lower precision) representations in machine learning. On the best ML hardware, this can lead to a 30x difference in processing power
I think this isn’t the best phrasing to convey the thing. The 30x difference is comparing Peak FP32 with Peak FP8 Tensor Core. But given the previous sentence implies you’re focusing on precision differences, the comparison should be Peak FP32 Tensor Core to Peak FP8 Tensor Core, which is only 4x. The non-sparsity numbers from Table 1 for NVIDIA H100 SXM5:
- Peak FP32: 66.9 TFLOPS
- Peak TF32 Tensor Core: 494.7 TFLOPS
- Peak FP8 Tensor Core: 1978.9 TFLOPS
I think this is a nitpick, but it threw me off so I figured I would say something. The point stands that “FLOPs” often leaves out important details, but it doesn’t seem like lower precision explains >10x the difference here.
- YafahEdelman 31 Aug 2023 21:33 UTC
  1 point
  2
  Parent
  Good catch, I think the 30x came from including the advantage given by tensor cores at all and not just lower precision data types.