Could you elaborate on your last paragraph about matrix -matrix multiplication versus vector matrix multiplication.
What does this have to do with the RAM being next to the processing units?
(As a general note, I think it would be useful for people trying to follow along if you would explain some of the technical terms you are using. Not everybody is a world-expert in GPU-design!
E.g. PIM, CMOS, MAC etc )
Matrix Matrix Mult of square matrices dim N uses ~2N3 ALU ops and ~3N2 MEM ops, so it has an arithmetic intensity of ~N (ALU:MEM ratio).
Vector Matrix Mult of dim N uses ~2N2 ALU and ~3N2 MEM, for an arithmetic intensity of ~1.
A GPU has an ALU:MEM ratio of about 1000:1 (for lower precision tensorcore ALU), so it is inefficient at vector matrix mult by a factor of about 1000 vs matrix matrix mult. The high ALU:MEM ratio is a natural result of the relative wire lengths: very short wire distances to shuffle values between FP units in a tensorcore vs very long wire distances to reach a value in off chip RAM.
The GPU needs numbers to be stored in registers inside the GPU before it can do operations on them. A memory operation (what Jacob calls MEM) is when you load a particular value from memory into a register. An arithmetic operation is when you do an elementary arithmetic operation such as addition or multiplication on two values that have already been loaded into registers. These are done by the arithmetic-logic unit (ALU) of the processor so are called ALU ops.
Because a matrix multiplication of two N×N matrices only involves 2N2 distinct floating point numbers as input, and writing the result back into memory is going to cost you another N2 memory operations, the total MEM ops cost of a matrix multiplication of two matrices of size N×N is 3N2. In contrast, if you’re using the naive matrix multiplication algorithm, computing each entry in the output matrix takes you N additions and N multiplications, so you end up with 2N⋅N2=2N3 ALU ops needed.
The ALU:MEM ratio is important because if your computation is imbalanced relative to what is supported by your hardware then you’ll end up being bottlenecked by one of them and you’ll be unable to exploit the surplus resources you have on the other side. For instance, if you’re working with a bizarre GPU that has a 1:1 ALU:MEM ratio, whenever you’re only using the hardware to do matrix multiplications you’ll have enormous amounts of MEM ops capacity sitting idle because you don’t have the capacity to be utilizing them.
Could you elaborate on your last paragraph about matrix -matrix multiplication versus vector matrix multiplication. What does this have to do with the RAM being next to the processing units?
(As a general note, I think it would be useful for people trying to follow along if you would explain some of the technical terms you are using. Not everybody is a world-expert in GPU-design! E.g. PIM, CMOS, MAC etc )
Matrix Matrix Mult of square matrices dim N uses ~2N3 ALU ops and ~3N2 MEM ops, so it has an arithmetic intensity of ~N (ALU:MEM ratio).
Vector Matrix Mult of dim N uses ~2N2 ALU and ~3N2 MEM, for an arithmetic intensity of ~1.
A GPU has an ALU:MEM ratio of about 1000:1 (for lower precision tensorcore ALU), so it is inefficient at vector matrix mult by a factor of about 1000 vs matrix matrix mult. The high ALU:MEM ratio is a natural result of the relative wire lengths: very short wire distances to shuffle values between FP units in a tensorcore vs very long wire distances to reach a value in off chip RAM.
What is ALU and MEM exactly? And what is the significance of the ALU:MEM ratio?
The GPU needs numbers to be stored in registers inside the GPU before it can do operations on them. A memory operation (what Jacob calls MEM) is when you load a particular value from memory into a register. An arithmetic operation is when you do an elementary arithmetic operation such as addition or multiplication on two values that have already been loaded into registers. These are done by the arithmetic-logic unit (ALU) of the processor so are called ALU ops.
Because a matrix multiplication of two N×N matrices only involves 2N2 distinct floating point numbers as input, and writing the result back into memory is going to cost you another N2 memory operations, the total MEM ops cost of a matrix multiplication of two matrices of size N×N is 3N2. In contrast, if you’re using the naive matrix multiplication algorithm, computing each entry in the output matrix takes you N additions and N multiplications, so you end up with 2N⋅N2=2N3 ALU ops needed.
The ALU:MEM ratio is important because if your computation is imbalanced relative to what is supported by your hardware then you’ll end up being bottlenecked by one of them and you’ll be unable to exploit the surplus resources you have on the other side. For instance, if you’re working with a bizarre GPU that has a 1:1 ALU:MEM ratio, whenever you’re only using the hardware to do matrix multiplications you’ll have enormous amounts of MEM ops capacity sitting idle because you don’t have the capacity to be utilizing them.
This is helpful, thanks a ton Ege!