There are many different types of “TFLOPS” that are not directly comparable, independent of precision. The TPU v5e does not have anything remotely close to 393 TFLOPs of general purpose ALU performance. The number you are quoting is the max perf of its dedicated matmul ALU ASIC units, which are most comparable to nvidia tensorcores, but worse as they are less flexible (much larger block volumes).
The RTX 4090 has ~82 TFLOPs of general purpose SIMD 32⁄16 bit flops—considerably more than the 51 or 67 TFLOPs of even the H100. I’m not sure what the general ALU flops of the TPU are, but it’s almost certainly much less than the H100 and therefore less than the 4090.
The 4090′s theoretical tensorcore perf is 330⁄661 for fp16[1] and 661/1321[2][3] for fp8 dense/sparse (sparse using nvidia’s 2:1 local block sparsity encoding), and 661 int8 TOPs (which isn’t as useful as fp8 of course). You seem to be using the sparse 2:1 fp8 tensorcore or possibly even 4bit pathway perf for H100, so that is most comparable. So if you are going to use INT8 precision for the TPU, well the 4090 has double that with 660 8-bit integer TOPS for about 1/4th the price. The 4090 has about an OOM lead in low precision flops/$ (in theory).
Of course what actually matters is practical real world benchmark perf due to the complex interactions between RAM and cache quantity, various types of bandwidths (on-chip across various caches, off-chip to RAM, between chips etc) and so on, and nvidia dominates in most real world benchmarks.
There are many different types of “TFLOPS” that are not directly comparable, independent of precision. The TPU v5e does not have anything remotely close to 393 TFLOPs of general purpose ALU performance. The number you are quoting is the max perf of its dedicated matmul ALU ASIC units, which are most comparable to nvidia tensorcores, but worse as they are less flexible (much larger block volumes).
The RTX 4090 has ~82 TFLOPs of general purpose SIMD 32⁄16 bit flops—considerably more than the 51 or 67 TFLOPs of even the H100. I’m not sure what the general ALU flops of the TPU are, but it’s almost certainly much less than the H100 and therefore less than the 4090.
The 4090′s theoretical tensorcore perf is 330⁄661 for fp16[1] and 661/1321[2][3] for fp8 dense/sparse (sparse using nvidia’s 2:1 local block sparsity encoding), and 661 int8 TOPs (which isn’t as useful as fp8 of course). You seem to be using the sparse 2:1 fp8 tensorcore or possibly even 4bit pathway perf for H100, so that is most comparable. So if you are going to use INT8 precision for the TPU, well the 4090 has double that with 660 8-bit integer TOPS for about 1/4th the price. The 4090 has about an OOM lead in low precision flops/$ (in theory).
Of course what actually matters is practical real world benchmark perf due to the complex interactions between RAM and cache quantity, various types of bandwidths (on-chip across various caches, off-chip to RAM, between chips etc) and so on, and nvidia dominates in most real world benchmarks.
wikipedia
toms hardware
nvidia ada gpu arch