However, over the past several years progress has been made on utilizing fewer bits per number (also called lower precision) representations in machine learning. On the best ML hardware, this can lead to a 30x difference in processing power
I think this isn’t the best phrasing to convey the thing. The 30x difference is comparing Peak FP32 with Peak FP8 Tensor Core. But given the previous sentence implies you’re focusing on precision differences, the comparison should be Peak FP32 Tensor Core to Peak FP8 Tensor Core, which is only 4x. The non-sparsity numbers from Table 1 for NVIDIA H100 SXM5:
Peak FP32: 66.9 TFLOPS
Peak TF32 Tensor Core: 494.7 TFLOPS
Peak FP8 Tensor Core: 1978.9 TFLOPS
I think this is a nitpick, but it threw me off so I figured I would say something. The point stands that “FLOPs” often leaves out important details, but it doesn’t seem like lower precision explains >10x the difference here.
I think this isn’t the best phrasing to convey the thing. The 30x difference is comparing Peak FP32 with Peak FP8 Tensor Core. But given the previous sentence implies you’re focusing on precision differences, the comparison should be Peak FP32 Tensor Core to Peak FP8 Tensor Core, which is only 4x. The non-sparsity numbers from Table 1 for NVIDIA H100 SXM5:
Peak FP32: 66.9 TFLOPS
Peak TF32 Tensor Core: 494.7 TFLOPS
Peak FP8 Tensor Core: 1978.9 TFLOPS
I think this is a nitpick, but it threw me off so I figured I would say something. The point stands that “FLOPs” often leaves out important details, but it doesn’t seem like lower precision explains >10x the difference here.
Good catch, I think the 30x came from including the advantage given by tensor cores at all and not just lower precision data types.