Alexander Gietelink Oldenziel comments on $250 prize for checking Jake Cannell’s Brain Efficiency

Alexander Gietelink Oldenziel 7 May 2023 8:40 UTC
2 points
0
Could you elaborate on your last paragraph about matrix -matrix multiplication versus vector matrix multiplication. What does this have to do with the RAM being next to the processing units?

(As a general note, I think it would be useful for people trying to follow along if you would explain some of the technical terms you are using. Not everybody is a world-expert in GPU-design! E.g. PIM, CMOS, MAC etc )
- jacob_cannell 7 May 2023 18:52 UTC
  6 points
  2
  Parent
  Matrix Matrix Mult of square matrices dim N uses ~ $2 N^{3}$ ALU ops and ~ $3 N^{2}$ MEM ops, so it has an arithmetic intensity of ~N (ALU:MEM ratio).
  
  Vector Matrix Mult of dim N uses ~2 $N^{2}$ ALU and ~3 $N^{2}$ MEM, for an arithmetic intensity of ~1.
  
  A GPU has an ALU:MEM ratio of about 1000:1 (for lower precision tensorcore ALU), so it is inefficient at vector matrix mult by a factor of about 1000 vs matrix matrix mult. The high ALU:MEM ratio is a natural result of the relative wire lengths: very short wire distances to shuffle values between FP units in a tensorcore vs very long wire distances to reach a value in off chip RAM.
  - Alexander Gietelink Oldenziel 18 May 2023 17:33 UTC
    2 points
    0
    Parent
    What is ALU and MEM exactly? And what is the significance of the ALU:MEM ratio?
    - Ege Erdil 19 May 2023 10:47 UTC
      12 points
      0
      Parent
      The GPU needs numbers to be stored in registers inside the GPU before it can do operations on them. A memory operation (what Jacob calls MEM) is when you load a particular value from memory into a register. An arithmetic operation is when you do an elementary arithmetic operation such as addition or multiplication on two values that have already been loaded into registers. These are done by the arithmetic-logic unit (ALU) of the processor so are called ALU ops.
      
      Because a matrix multiplication of two $N \times N$ matrices only involves $2 N^{2}$ distinct floating point numbers as input, and writing the result back into memory is going to cost you another $N^{2}$ memory operations, the total MEM ops cost of a matrix multiplication of two matrices of size $N \times N$ is $3 N^{2}$ . In contrast, if you’re using the naive matrix multiplication algorithm, computing each entry in the output matrix takes you $N$ additions and $N$ multiplications, so you end up with $2 N \cdot N^{2} = 2 N^{3}$ ALU ops needed.
      
      The ALU:MEM ratio is important because if your computation is imbalanced relative to what is supported by your hardware then you’ll end up being bottlenecked by one of them and you’ll be unable to exploit the surplus resources you have on the other side. For instance, if you’re working with a bizarre GPU that has a 1:1 ALU:MEM ratio, whenever you’re only using the hardware to do matrix multiplications you’ll have enormous amounts of MEM ops capacity sitting idle because you don’t have the capacity to be utilizing them.
      - Alexander Gietelink Oldenziel 19 May 2023 13:35 UTC
        3 points
        0
        Parent
        This is helpful, thanks a ton Ege!