Further item of “these elaborate calculations seem to arrive at conclusions that can’t possibly be true”—besides the brain allegedly being close to the border of thermodynamic efficiency, despite visibly using tens of thousands of redundant physical ops in terms of sheer number of ions and neurotransmitters pumped; the same calculations claim that modern GPUs are approaching brain efficiency, the Limit of the Possible, so presumably at the Limit of the Possible themselves.
This source claims 100x energy efficiency from substituting some basic physical analog operations for multiply-accumulate, instead of digital transistor operations about them, even if you otherwise use actual real-world physical hardware. Sounds right to me; it would make no sense for such a vastly redundant digital computation of such a simple physical quantity to be anywhere near the borders of efficiency! https://spectrum.ieee.org/analog-ai
I’m not sure why you believe “the same calculations claim that modern GPUs are approaching brain efficiency, the Limit of the Possible”. GPUs require at least on order ~1e-11J to fetch a single 8-bit value from GDDRX RAM (1e-19 J/b/nm (interconnect wire energy) * 1cm * 8), so around ~1KW or 100x the brain for 1e14 of those per second, not even including flop energy cost (the brain doesn’t have much more efficient wires, it just minimizes that entire cost by moving the memory synapses/weights as close as possible to the compute .. by merging them). I do claim that Moore’s Law is ending and not delivering much farther increase in CMOS energy efficiency (and essentially zero increase in wire energy efficiency), but GPUs are far from the optimal use of CMOS towards running NNs.
This source claims 100x energy efficiency from substituting some basic physical analog operations for multiply-accumulate,
That sounds about right, and Indeed I roughly estimate the minimal energy for 8 bit analog MAC at the end of the synapse section, with 4 refs examples from the research lit:
We can also compare the minimal energy prediction of 10−15J/op for 8-bit equivalent analog multiply-add to the known and predicted values for upcoming efficient analog accelerators, which mostly have energy efficiency in the 10−14J/op range[1][2][3][4] for < 8 bit, with the higher reported values around 10−15J/op similar to the brain estimate here, but only for < 4-bit precision[5]. Analog devices can not be shrunk down to few nm sizes without sacrificing SNR and precision; their minimal size is determined by the need for a large number of carriers on order 2c∗β for equivalent bit precision β, and c ~ 2, as discussed earlier.
The more complicated part of comparing these is how/whether to include the cost of reading/writing a synapse/weight value from RAM across a long wire, which is required for full equivalence to the brain. The brain as a true RNN is doing Vector Matrix multiplication, whereas GPUs/Accelerators instead do Matrix Matrix multiplication to amortize the cost of expensive RAM fetches. VM mult can simulate MM mult at no extra cost, but MM mult can only simulate VM mult at huge inefficiency proportional to the minimal matrix size (determined by ALU/RAM ratio, ~1000:1 now at low precision). The full neuromorphic or PIM approach instead moves the RAM next to the processing elements, and is naturally more suited to VM mult.
Chen, Jia, et al. “Multiply accumulate operations in memristor crossbar arrays for analog computing.” Journal of Semiconductors 42.1 (2021): 013104. gs-link
Mahmoodi, M. Reza, and Dmitri Strukov. “Breaking POps/J barrier with analog multiplier circuits based on nonvolatile memories.” Proceedings of the International Symposium on Low Power Electronics and Design. 2018. gs-link
Okay, if you’re not saying GPUs are getting around as efficient as the human brain, without much more efficiency to be eeked out, then I straightforwardly misunderstood that part.
Could you elaborate on your last paragraph about matrix -matrix multiplication versus vector matrix multiplication.
What does this have to do with the RAM being next to the processing units?
(As a general note, I think it would be useful for people trying to follow along if you would explain some of the technical terms you are using. Not everybody is a world-expert in GPU-design!
E.g. PIM, CMOS, MAC etc )
Matrix Matrix Mult of square matrices dim N uses ~2N3 ALU ops and ~3N2 MEM ops, so it has an arithmetic intensity of ~N (ALU:MEM ratio).
Vector Matrix Mult of dim N uses ~2N2 ALU and ~3N2 MEM, for an arithmetic intensity of ~1.
A GPU has an ALU:MEM ratio of about 1000:1 (for lower precision tensorcore ALU), so it is inefficient at vector matrix mult by a factor of about 1000 vs matrix matrix mult. The high ALU:MEM ratio is a natural result of the relative wire lengths: very short wire distances to shuffle values between FP units in a tensorcore vs very long wire distances to reach a value in off chip RAM.
The GPU needs numbers to be stored in registers inside the GPU before it can do operations on them. A memory operation (what Jacob calls MEM) is when you load a particular value from memory into a register. An arithmetic operation is when you do an elementary arithmetic operation such as addition or multiplication on two values that have already been loaded into registers. These are done by the arithmetic-logic unit (ALU) of the processor so are called ALU ops.
Because a matrix multiplication of two N×N matrices only involves 2N2 distinct floating point numbers as input, and writing the result back into memory is going to cost you another N2 memory operations, the total MEM ops cost of a matrix multiplication of two matrices of size N×N is 3N2. In contrast, if you’re using the naive matrix multiplication algorithm, computing each entry in the output matrix takes you N additions and N multiplications, so you end up with 2N⋅N2=2N3 ALU ops needed.
The ALU:MEM ratio is important because if your computation is imbalanced relative to what is supported by your hardware then you’ll end up being bottlenecked by one of them and you’ll be unable to exploit the surplus resources you have on the other side. For instance, if you’re working with a bizarre GPU that has a 1:1 ALU:MEM ratio, whenever you’re only using the hardware to do matrix multiplications you’ll have enormous amounts of MEM ops capacity sitting idle because you don’t have the capacity to be utilizing them.
Further item of “these elaborate calculations seem to arrive at conclusions that can’t possibly be true”—besides the brain allegedly being close to the border of thermodynamic efficiency, despite visibly using tens of thousands of redundant physical ops in terms of sheer number of ions and neurotransmitters pumped; the same calculations claim that modern GPUs are approaching brain efficiency, the Limit of the Possible, so presumably at the Limit of the Possible themselves.
This source claims 100x energy efficiency from substituting some basic physical analog operations for multiply-accumulate, instead of digital transistor operations about them, even if you otherwise use actual real-world physical hardware. Sounds right to me; it would make no sense for such a vastly redundant digital computation of such a simple physical quantity to be anywhere near the borders of efficiency! https://spectrum.ieee.org/analog-ai
I’m not sure why you believe “the same calculations claim that modern GPUs are approaching brain efficiency, the Limit of the Possible”. GPUs require at least on order ~1e-11J to fetch a single 8-bit value from GDDRX RAM (1e-19 J/b/nm (interconnect wire energy) * 1cm * 8), so around ~1KW or 100x the brain for 1e14 of those per second, not even including flop energy cost (the brain doesn’t have much more efficient wires, it just minimizes that entire cost by moving the memory synapses/weights as close as possible to the compute .. by merging them). I do claim that Moore’s Law is ending and not delivering much farther increase in CMOS energy efficiency (and essentially zero increase in wire energy efficiency), but GPUs are far from the optimal use of CMOS towards running NNs.
That sounds about right, and Indeed I roughly estimate the minimal energy for 8 bit analog MAC at the end of the synapse section, with 4 refs examples from the research lit:
The more complicated part of comparing these is how/whether to include the cost of reading/writing a synapse/weight value from RAM across a long wire, which is required for full equivalence to the brain. The brain as a true RNN is doing Vector Matrix multiplication, whereas GPUs/Accelerators instead do Matrix Matrix multiplication to amortize the cost of expensive RAM fetches. VM mult can simulate MM mult at no extra cost, but MM mult can only simulate VM mult at huge inefficiency proportional to the minimal matrix size (determined by ALU/RAM ratio, ~1000:1 now at low precision). The full neuromorphic or PIM approach instead moves the RAM next to the processing elements, and is naturally more suited to VM mult.
Bavandpour, Mohammad, et al. “Mixed-Signal Neuromorphic Processors: Quo Vadis?” 2019 IEEE SOI-3D-Subthreshold Microelectronics Technology Unified Conference (S3S). IEEE, 2019. gs-link
Chen, Jia, et al. “Multiply accumulate operations in memristor crossbar arrays for analog computing.” Journal of Semiconductors 42.1 (2021): 013104. gs-link
Li, Huihan, et al. “Memristive crossbar arrays for storage and computing applications.” Advanced Intelligent Systems 3.9 (2021): 2100017. gs-link
Li, Can, et al. “Analogue signal and image processing with large memristor crossbars.” Nature electronics 1.1 (2018): 52-59. gs-link
Mahmoodi, M. Reza, and Dmitri Strukov. “Breaking POps/J barrier with analog multiplier circuits based on nonvolatile memories.” Proceedings of the International Symposium on Low Power Electronics and Design. 2018. gs-link
Okay, if you’re not saying GPUs are getting around as efficient as the human brain, without much more efficiency to be eeked out, then I straightforwardly misunderstood that part.
Could you elaborate on your last paragraph about matrix -matrix multiplication versus vector matrix multiplication. What does this have to do with the RAM being next to the processing units?
(As a general note, I think it would be useful for people trying to follow along if you would explain some of the technical terms you are using. Not everybody is a world-expert in GPU-design! E.g. PIM, CMOS, MAC etc )
Matrix Matrix Mult of square matrices dim N uses ~2N3 ALU ops and ~3N2 MEM ops, so it has an arithmetic intensity of ~N (ALU:MEM ratio).
Vector Matrix Mult of dim N uses ~2N2 ALU and ~3N2 MEM, for an arithmetic intensity of ~1.
A GPU has an ALU:MEM ratio of about 1000:1 (for lower precision tensorcore ALU), so it is inefficient at vector matrix mult by a factor of about 1000 vs matrix matrix mult. The high ALU:MEM ratio is a natural result of the relative wire lengths: very short wire distances to shuffle values between FP units in a tensorcore vs very long wire distances to reach a value in off chip RAM.
What is ALU and MEM exactly? And what is the significance of the ALU:MEM ratio?
The GPU needs numbers to be stored in registers inside the GPU before it can do operations on them. A memory operation (what Jacob calls MEM) is when you load a particular value from memory into a register. An arithmetic operation is when you do an elementary arithmetic operation such as addition or multiplication on two values that have already been loaded into registers. These are done by the arithmetic-logic unit (ALU) of the processor so are called ALU ops.
Because a matrix multiplication of two N×N matrices only involves 2N2 distinct floating point numbers as input, and writing the result back into memory is going to cost you another N2 memory operations, the total MEM ops cost of a matrix multiplication of two matrices of size N×N is 3N2. In contrast, if you’re using the naive matrix multiplication algorithm, computing each entry in the output matrix takes you N additions and N multiplications, so you end up with 2N⋅N2=2N3 ALU ops needed.
The ALU:MEM ratio is important because if your computation is imbalanced relative to what is supported by your hardware then you’ll end up being bottlenecked by one of them and you’ll be unable to exploit the surplus resources you have on the other side. For instance, if you’re working with a bizarre GPU that has a 1:1 ALU:MEM ratio, whenever you’re only using the hardware to do matrix multiplications you’ll have enormous amounts of MEM ops capacity sitting idle because you don’t have the capacity to be utilizing them.
This is helpful, thanks a ton Ege!