This is 1e8 times higher than the Landauer limit of 2e-2 eV per bit erasure at 70 C (and the ratio of bit erasures per FP16 operation is unclear to me; let’s pretend it’s O(1))
An H100 performs 1e6 FP16 operations per clock cycle, which implies 8e4 transistors per FP16 operation (some of which may be inactive, of course)
This seems pretty inefficient to me!
To recap, modern chips are roughly ~8 orders of magnitude worse than the Landauer limit (with a bit erasure per FP16 operation fudge factor that isn’t going to exceed 10). And this is in a configuration that takes 8e4 transistors to support a single FP16 operation!
Positing that brains are ~6 orders of magnitude more energy efficient than today’s transistor circuits doesn’t seem at all crazy to me. ~6 orders of improvement on 2e6 is ~2 eV per operation, still two orders of magnitude above the 0.02 eV per bit erasure Landauer limit.
I’ll note too that cells synthesize informative sequences from nucleic acids using less than 1 eV of free energy per bit. That clearly doesn’t violate Landauer or any laws of physics, because we know it happens.
2e6 eV are spent per FP16 operation… This is 1e8 times higher than the Landauer limit of 2e-2 eV per bit erasure at 70 C (and the ratio of bit erasures per FP16 operation is unclear to me; let’s pretend it’s O(1))
2e-2 eV for the Landauer limit is right, but 2e6 eV per FP16 operation is off by one order of magnitude. (70 W)/(2e15 FLOP/s) = 0.218 MeV. So the gap is 7 orders of magnitude assuming one bit erasure per FLOP.
This is wrong, the power consumption is 700 W so the gap is indeed 8 orders of magnitude.
An H100 SXM has 8e10 transistors, 2e9 Hz boost frequency, 70 W 700 W of max power consumption...
8e10 * 2e9 = 1.6e20 transistor switches per second. This happens with a power consumption of 700 W, suggesting that each switch dissipates on the order of 30 eV of energy, which is only 3 OOM or so from the Landauer limit. So this device is actually not that inefficient if you look only at how efficiently it’s able to perform switches. My position is that you should not expect the brain to be much more efficient than this, though perhaps gaining one or two orders of magnitude is possible with complex error correction methods.
Of course, the transistors supporting per FLOP and the switching frequency gap have to add up to the 8 OOM overall efficiency gap we’ve calculated. However, it’s important that most of the inefficiency comes from the former and not the latter. I’ll elaborate on this later in the comment.
This seems pretty inefficient to me!
I agree an H100 SXM is not a very efficient computational device. I never said modern GPUs represent the pinnacle of energy efficiency in computation or anything like that, though similar claims have previously been made by others on the forum.
Positing that brains are ~6 orders of magnitude more energy efficient than today’s transistor circuits doesn’t seem at all crazy to me. ~6 orders of improvement on 2e6 is ~2 eV per operation, still two orders of magnitude above the 0.02 eV per bit erasure Landauer limit.
Here we’re talking about the brain possibly doing 1e20 FLOP/s, which I’ve previously said is maybe within one order of magnitude of the Landauer limit or so, and not the more extravagant figure of 1e25 FLOP/s. The disagreement here is not about math; we both agree that this performance requires the brain to be 1 or 2 OOM from the bitwise Landauer limit depending on exactly how many bit erasures you think are involved in a single 16-bit FLOP.
The disagreement is more about how close you think the brain can come to this limit. Most of the energy losses in modern GPUs come from the enormous amounts of noise that you need to deal with in interconnects that are closely packed together. To get anywhere close to the bitwise Landauer limit, you need to get rid of all of these losses. This is what would be needed to lower the amount of transistors supporting per FLOP without also simultaneously increasing the power consumption of the device.
I just don’t see how the brain could possibly pull that off. The design constraints are pretty similar in both cases, and the brain is not using some unique kind of material or architecture which could eliminate dissipative or radiative energy losses in the system. Just as information needs to get carried around inside a GPU, information also needs to move inside the brain, and moving information around in a noisy environment is costly. So I would expect by default that the brain is many orders of magnitude from the Landauer limit, though I can see estimates as high as 1e17 FLOP/s being plausible if the brain is highly efficient. I just think you’ll always be losing many orders of magnitude relative to Landauer as long as your system is not ideal, and the brain is far from an ideal system.
I’ll note too that cells synthesize informative sequences from nucleic acids using less than 1 eV of free energy per bit. That clearly doesn’t violate Landauer or any laws of physics, because we know it happens.
I don’t think you’ll lose as much relative to Landauer when you’re doing that, because you don’t have to move a lot of information around constantly. Transcribing a DNA sequence and other similar operations are local. The reason I think realistic devices will fall far short of Landauer is because of the problem of interconnect: computations cannot be localized effectively, so different parts of your hardware need to talk to each other, and that’s where you lose most of the energy. In terms of pure switching efficiency of transistors, we’re already pretty close to this kind of biological process, as I’ve calculated above.
One potential advantage of the brain is that it is 3D, whereas chips are mostly 2D. I wonder what advantage that confers. Presumably getting information around is much easier with 50% more dimensions.
Probably true, and this could mean the brain has some substantial advantage over today’s hardware (like 1 OOM, say) but at the same time the internal mechanisms that biology uses to establish electrical potential energy gradients and so forth seem so inefficient. Quoting Eliezer;
I’m confused at how somebody ends up calculating that a brain—where each synaptic spike is transmitted by ~10,000 neurotransmitter molecules (according to a quick online check), which then get pumped back out of the membrane and taken back up by the synapse; and the impulse is then shepherded along cellular channels via thousands of ions flooding through a membrane to depolarize it and then getting pumped back out using ATP, all of which are thermodynamically irreversible operations individually—could possibly be within three orders of magnitude of max thermodynamic efficiency at 300 Kelvin. I have skimmed “Brain Efficiency” though not checked any numbers, and not seen anything inside it which seems to address this sanity check.
I’m posting this as a separate comment because it’s a different line of argument, but I think we should also keep it in mind when making estimates of how much computation the brain could actually be using.
If the brain is operating at a frequency of (say) 10 Hz and is doing 1e20 FLOP/s, that suggests the brain has something like 1e19 floating point parameters, or maybe specifying the “internal state” of the brain takes something like 1e20 bits. If you want to properly train a neural network of this size, you need to update on a comparable amount of useful entropy from the outside world. This means you have to believe that humans are receiving on the order of 1e11 bits or 10 GB of useful information about the world to update on every second if the brain is to be “fully trained” by the age of 30, say.
An estimate of 1e15 FLOP/s brings this down to a more realistic 100 KB or so, which still seems like a lot but is somewhat more believable if you consider the potential information content of visual and auditory stimuli. I think even this is an overestimate and that the brain has some algorithmic insights which make it somewhat more data efficient than contemporary neural networks, but I think the gap implied by 1e20 FLOP/s is rather too large for me to believe it.
Let me try writing out some estimates. My math is different than yours.
An H100 SXM has:
8e10 transistors
2e9 Hz boost frequency of
2e15 FLOPS at FP16
7e2 W of max power consumption
Therefore:
2e6 eV are spent per FP16 operation
This is 1e8 times higher than the Landauer limit of 2e-2 eV per bit erasure at 70 C (and the ratio of bit erasures per FP16 operation is unclear to me; let’s pretend it’s O(1))
An H100 performs 1e6 FP16 operations per clock cycle, which implies 8e4 transistors per FP16 operation (some of which may be inactive, of course)
This seems pretty inefficient to me!
To recap, modern chips are roughly ~8 orders of magnitude worse than the Landauer limit (with a bit erasure per FP16 operation fudge factor that isn’t going to exceed 10). And this is in a configuration that takes 8e4 transistors to support a single FP16 operation!
Positing that brains are ~6 orders of magnitude more energy efficient than today’s transistor circuits doesn’t seem at all crazy to me. ~6 orders of improvement on 2e6 is ~2 eV per operation, still two orders of magnitude above the 0.02 eV per bit erasure Landauer limit.
I’ll note too that cells synthesize informative sequences from nucleic acids using less than 1 eV of free energy per bit. That clearly doesn’t violate Landauer or any laws of physics, because we know it happens.
2e-2 eV for the Landauer limit is right, but 2e6 eV per FP16 operation is off by one order of magnitude. (70 W)/(2e15 FLOP/s) = 0.218 MeV. So the gap is 7 orders of magnitude assuming one bit erasure per FLOP.This is wrong, the power consumption is 700 W so the gap is indeed 8 orders of magnitude.
8e10 * 2e9 = 1.6e20 transistor switches per second. This happens with a power consumption of 700 W, suggesting that each switch dissipates on the order of 30 eV of energy, which is only 3 OOM or so from the Landauer limit. So this device is actually not that inefficient if you look only at how efficiently it’s able to perform switches. My position is that you should not expect the brain to be much more efficient than this, though perhaps gaining one or two orders of magnitude is possible with complex error correction methods.
Of course, the transistors supporting per FLOP and the switching frequency gap have to add up to the 8 OOM overall efficiency gap we’ve calculated. However, it’s important that most of the inefficiency comes from the former and not the latter. I’ll elaborate on this later in the comment.
I agree an H100 SXM is not a very efficient computational device. I never said modern GPUs represent the pinnacle of energy efficiency in computation or anything like that, though similar claims have previously been made by others on the forum.
Here we’re talking about the brain possibly doing 1e20 FLOP/s, which I’ve previously said is maybe within one order of magnitude of the Landauer limit or so, and not the more extravagant figure of 1e25 FLOP/s. The disagreement here is not about math; we both agree that this performance requires the brain to be 1 or 2 OOM from the bitwise Landauer limit depending on exactly how many bit erasures you think are involved in a single 16-bit FLOP.
The disagreement is more about how close you think the brain can come to this limit. Most of the energy losses in modern GPUs come from the enormous amounts of noise that you need to deal with in interconnects that are closely packed together. To get anywhere close to the bitwise Landauer limit, you need to get rid of all of these losses. This is what would be needed to lower the amount of transistors supporting per FLOP without also simultaneously increasing the power consumption of the device.
I just don’t see how the brain could possibly pull that off. The design constraints are pretty similar in both cases, and the brain is not using some unique kind of material or architecture which could eliminate dissipative or radiative energy losses in the system. Just as information needs to get carried around inside a GPU, information also needs to move inside the brain, and moving information around in a noisy environment is costly. So I would expect by default that the brain is many orders of magnitude from the Landauer limit, though I can see estimates as high as 1e17 FLOP/s being plausible if the brain is highly efficient. I just think you’ll always be losing many orders of magnitude relative to Landauer as long as your system is not ideal, and the brain is far from an ideal system.
I don’t think you’ll lose as much relative to Landauer when you’re doing that, because you don’t have to move a lot of information around constantly. Transcribing a DNA sequence and other similar operations are local. The reason I think realistic devices will fall far short of Landauer is because of the problem of interconnect: computations cannot be localized effectively, so different parts of your hardware need to talk to each other, and that’s where you lose most of the energy. In terms of pure switching efficiency of transistors, we’re already pretty close to this kind of biological process, as I’ve calculated above.
One potential advantage of the brain is that it is 3D, whereas chips are mostly 2D. I wonder what advantage that confers. Presumably getting information around is much easier with 50% more dimensions.
Probably true, and this could mean the brain has some substantial advantage over today’s hardware (like 1 OOM, say) but at the same time the internal mechanisms that biology uses to establish electrical potential energy gradients and so forth seem so inefficient. Quoting Eliezer;
Max power is 700 W, not 70 W. These chips are water-cooled beasts. Your estimate is off, not mine.
Huh, I wonder why I read 7e2 W as 70 W. Strange mistake.
No worries. I’ve made far worse. I only wish that H100s could operate at a gentle 70 W! :)
I’m posting this as a separate comment because it’s a different line of argument, but I think we should also keep it in mind when making estimates of how much computation the brain could actually be using.
If the brain is operating at a frequency of (say) 10 Hz and is doing 1e20 FLOP/s, that suggests the brain has something like 1e19 floating point parameters, or maybe specifying the “internal state” of the brain takes something like 1e20 bits. If you want to properly train a neural network of this size, you need to update on a comparable amount of useful entropy from the outside world. This means you have to believe that humans are receiving on the order of 1e11 bits or 10 GB of useful information about the world to update on every second if the brain is to be “fully trained” by the age of 30, say.
An estimate of 1e15 FLOP/s brings this down to a more realistic 100 KB or so, which still seems like a lot but is somewhat more believable if you consider the potential information content of visual and auditory stimuli. I think even this is an overestimate and that the brain has some algorithmic insights which make it somewhat more data efficient than contemporary neural networks, but I think the gap implied by 1e20 FLOP/s is rather too large for me to believe it.