Just to pick on the step that gets the lowest probability in your calculation, estimating that the human brain does 1e20 FLOP/s with only 20 W of power consumption requires believing that the brain is basically operating at the bitwise Landauer limit, which is around 3e20 bit erasures per watt per second at room temperature. If the FLOP we’re talking about here is equivalent of operations on 8-bit floating point numbers, for example, the human brain would have an energy efficiency of around 1e20 bit erasures per watt, which is less than one order of magnitude from the Landauer limit at room temperature of 300 K.
Needless to say, I find this estimate highly unrealistic. We have no idea how to build practical densely packed devices which get anywhere close to this limit; the best we can do at the moment is perhaps 5 orders of magnitude away. Are you really thinking that the human brain is 5 OOM more energy efficient than an A100?
Still, even this estimate is much more realistic than your claim that the human brain might take 8e34 FLOP to train, which ascribes a ludicrous ~ 1e26 FLOP/s computation capacity to the human brain if this training happens over 20 years. This obviously violates the Landauer limit on computation and so is going to be simply false, unless you think the human brain loses less than one bit of information per 1e5 floating point operations it’s doing. Good luck with that.
I notice that Steven Byrnes has already made the argument that these estimates are poor, but I want to hammer home the point that they are not just poor, they are crazy. Obviously, a mistake this flagrant does not inspire confidence in the rest of the argument.
This is 1e8 times higher than the Landauer limit of 2e-2 eV per bit erasure at 70 C (and the ratio of bit erasures per FP16 operation is unclear to me; let’s pretend it’s O(1))
An H100 performs 1e6 FP16 operations per clock cycle, which implies 8e4 transistors per FP16 operation (some of which may be inactive, of course)
This seems pretty inefficient to me!
To recap, modern chips are roughly ~8 orders of magnitude worse than the Landauer limit (with a bit erasure per FP16 operation fudge factor that isn’t going to exceed 10). And this is in a configuration that takes 8e4 transistors to support a single FP16 operation!
Positing that brains are ~6 orders of magnitude more energy efficient than today’s transistor circuits doesn’t seem at all crazy to me. ~6 orders of improvement on 2e6 is ~2 eV per operation, still two orders of magnitude above the 0.02 eV per bit erasure Landauer limit.
I’ll note too that cells synthesize informative sequences from nucleic acids using less than 1 eV of free energy per bit. That clearly doesn’t violate Landauer or any laws of physics, because we know it happens.
2e6 eV are spent per FP16 operation… This is 1e8 times higher than the Landauer limit of 2e-2 eV per bit erasure at 70 C (and the ratio of bit erasures per FP16 operation is unclear to me; let’s pretend it’s O(1))
2e-2 eV for the Landauer limit is right, but 2e6 eV per FP16 operation is off by one order of magnitude. (70 W)/(2e15 FLOP/s) = 0.218 MeV. So the gap is 7 orders of magnitude assuming one bit erasure per FLOP.
This is wrong, the power consumption is 700 W so the gap is indeed 8 orders of magnitude.
An H100 SXM has 8e10 transistors, 2e9 Hz boost frequency, 70 W 700 W of max power consumption...
8e10 * 2e9 = 1.6e20 transistor switches per second. This happens with a power consumption of 700 W, suggesting that each switch dissipates on the order of 30 eV of energy, which is only 3 OOM or so from the Landauer limit. So this device is actually not that inefficient if you look only at how efficiently it’s able to perform switches. My position is that you should not expect the brain to be much more efficient than this, though perhaps gaining one or two orders of magnitude is possible with complex error correction methods.
Of course, the transistors supporting per FLOP and the switching frequency gap have to add up to the 8 OOM overall efficiency gap we’ve calculated. However, it’s important that most of the inefficiency comes from the former and not the latter. I’ll elaborate on this later in the comment.
This seems pretty inefficient to me!
I agree an H100 SXM is not a very efficient computational device. I never said modern GPUs represent the pinnacle of energy efficiency in computation or anything like that, though similar claims have previously been made by others on the forum.
Positing that brains are ~6 orders of magnitude more energy efficient than today’s transistor circuits doesn’t seem at all crazy to me. ~6 orders of improvement on 2e6 is ~2 eV per operation, still two orders of magnitude above the 0.02 eV per bit erasure Landauer limit.
Here we’re talking about the brain possibly doing 1e20 FLOP/s, which I’ve previously said is maybe within one order of magnitude of the Landauer limit or so, and not the more extravagant figure of 1e25 FLOP/s. The disagreement here is not about math; we both agree that this performance requires the brain to be 1 or 2 OOM from the bitwise Landauer limit depending on exactly how many bit erasures you think are involved in a single 16-bit FLOP.
The disagreement is more about how close you think the brain can come to this limit. Most of the energy losses in modern GPUs come from the enormous amounts of noise that you need to deal with in interconnects that are closely packed together. To get anywhere close to the bitwise Landauer limit, you need to get rid of all of these losses. This is what would be needed to lower the amount of transistors supporting per FLOP without also simultaneously increasing the power consumption of the device.
I just don’t see how the brain could possibly pull that off. The design constraints are pretty similar in both cases, and the brain is not using some unique kind of material or architecture which could eliminate dissipative or radiative energy losses in the system. Just as information needs to get carried around inside a GPU, information also needs to move inside the brain, and moving information around in a noisy environment is costly. So I would expect by default that the brain is many orders of magnitude from the Landauer limit, though I can see estimates as high as 1e17 FLOP/s being plausible if the brain is highly efficient. I just think you’ll always be losing many orders of magnitude relative to Landauer as long as your system is not ideal, and the brain is far from an ideal system.
I’ll note too that cells synthesize informative sequences from nucleic acids using less than 1 eV of free energy per bit. That clearly doesn’t violate Landauer or any laws of physics, because we know it happens.
I don’t think you’ll lose as much relative to Landauer when you’re doing that, because you don’t have to move a lot of information around constantly. Transcribing a DNA sequence and other similar operations are local. The reason I think realistic devices will fall far short of Landauer is because of the problem of interconnect: computations cannot be localized effectively, so different parts of your hardware need to talk to each other, and that’s where you lose most of the energy. In terms of pure switching efficiency of transistors, we’re already pretty close to this kind of biological process, as I’ve calculated above.
One potential advantage of the brain is that it is 3D, whereas chips are mostly 2D. I wonder what advantage that confers. Presumably getting information around is much easier with 50% more dimensions.
Probably true, and this could mean the brain has some substantial advantage over today’s hardware (like 1 OOM, say) but at the same time the internal mechanisms that biology uses to establish electrical potential energy gradients and so forth seem so inefficient. Quoting Eliezer;
I’m confused at how somebody ends up calculating that a brain—where each synaptic spike is transmitted by ~10,000 neurotransmitter molecules (according to a quick online check), which then get pumped back out of the membrane and taken back up by the synapse; and the impulse is then shepherded along cellular channels via thousands of ions flooding through a membrane to depolarize it and then getting pumped back out using ATP, all of which are thermodynamically irreversible operations individually—could possibly be within three orders of magnitude of max thermodynamic efficiency at 300 Kelvin. I have skimmed “Brain Efficiency” though not checked any numbers, and not seen anything inside it which seems to address this sanity check.
I’m posting this as a separate comment because it’s a different line of argument, but I think we should also keep it in mind when making estimates of how much computation the brain could actually be using.
If the brain is operating at a frequency of (say) 10 Hz and is doing 1e20 FLOP/s, that suggests the brain has something like 1e19 floating point parameters, or maybe specifying the “internal state” of the brain takes something like 1e20 bits. If you want to properly train a neural network of this size, you need to update on a comparable amount of useful entropy from the outside world. This means you have to believe that humans are receiving on the order of 1e11 bits or 10 GB of useful information about the world to update on every second if the brain is to be “fully trained” by the age of 30, say.
An estimate of 1e15 FLOP/s brings this down to a more realistic 100 KB or so, which still seems like a lot but is somewhat more believable if you consider the potential information content of visual and auditory stimuli. I think even this is an overestimate and that the brain has some algorithmic insights which make it somewhat more data efficient than contemporary neural networks, but I think the gap implied by 1e20 FLOP/s is rather too large for me to believe it.
Thanks for the constructive comments. I’m open-minded to being wrong here. I’ve already updated a bit and I’m happy to update more.
Regarding the Landauer limit, I’m confused by a few things:
First, I’m confused by your linkage between floating point operations and information erasure. For example, if we have two 8-bit registers (A, B) and multiply to get (A, B*A), we’ve done an 8-bit floating point operation without 8 bits of erasure. It seems quite plausible to be that the brain does 1e20 FLOPS but with a much smaller rate of bit erasures.
Second, I have no idea how to map the fidelity of brain operations to floating point precision, so I really don’t know if we should be comparing 1 bit, 8 bit, 64 bit, or not at all. Any ideas?
Regarding training requiring 8e34 floating point operations:
First, I’m confused by your linkage between floating point operations and information erasure. For example, if we have two 8-bit registers (A, B) and multiply to get (A, B*A), we’ve done an 8-bit floating point operation without 8 bits of erasure. It seems quite plausible to be that the brain does 1e20 FLOPS but with a much smaller rate of bit erasures.
As a minor nitpick, if A and B are 8-bit floating point numbers then the multiplication map x → B*x is almost never injective. This means even in your idealized setup, the operation (A, B) → (A, B*A) is going to lose some information, though I agree that this information loss will be << 8 bits, probably more like 1 bit amortized or so.
The bigger problem is that logical reversibility doesn’t imply physical reversibility. I can think of ways in which we could set up sophisticated classical computation devices which are logically reversible, and perhaps could be made approximately physically reversible when operating in a near-adiabatic regime at low frequencies, but the brain is not operating in this regime (especially if it’s performing 1e20 FLOP/s). At high frequencies, I just don’t see which architecture you have in mind to perform lots of 8-bit floating point multiplications without raising the entropy of the environment by on the order of 8 bits.
Again using your setup, if you actually tried to implement (A, B) → (A, A*B) on a physical device, you would need to take the register that is storing B and replace the stored value with A*B instead. To store 1 bit of information you need a potential energy barrier that’s at least as high as k_B T log(2), so you need to switch ~ 8 such barriers, which means in any kind of realistic device you’ll lose ~ 8 k_B T log(2) of electrical potential energy to heat, either through resistance or through radiation. It doesn’t have to be like this, and some idealized device could do better, but GPUs are not idealized devices and neither are brains.
Ajeya Cotra estimates training could take anything from 1e24 to 1e54 floating point operations, or even more. Her narrower lifetime anchor ranges from 1e24 to 1e38ish.
Two points about that:
This is a measure that takes into account the uncertainty over how much less efficient our software is compared to the human brain. I agree that human lifetime learning compute being around 1e25 FLOP is not strong evidence that the first TAI system we train will use 1e25 FLOP of compute; I expect it to take significantly more than that.
Moreover, this is an estimate of effective FLOP, meaning that Cotra takes into account the possibility that software efficiency progress can reduce the physical computational cost of training a TAI system in the future. It was also in units of 2020 FLOP, and we’re already in 2023, so just on that basis alone, these numbers should get adjusted downwards now.
Do you think Cotra’s estimates are not just poor, but crazy as well?
No, because Cotra doesn’t claim that the human brain performs 1e25 FLOP/s—her claim is quite different.
The claim that “the first AI system to match the performance of the human brain might require 1e25 FLOP/s to run” is not necessarily crazy, though it needs to be supported by evidence of the relative inefficiency of our algorithms compared to the human brain and by estimates of how much software progress we should expect to be made in the future.
Thanks, that’s clarifying. (And yes, I’m well aware that x → B*x is almost never injective, which is why I said it wouldn’t cause 8 bits of erasure rather than the stronger, incorrect claim of 0 bits of erasure.)
To store 1 bit of information you need a potential energy barrier that’s at least as high as k_B T log(2), so you need to switch ~ 8 such barriers, which means in any kind of realistic device you’ll lose ~ 8 k_B T log(2) of electrical potential energy to heat, either through resistance or through radiation. It doesn’t have to be like this, and some idealized device could do better, but GPUs are not idealized devices and neither are brains.
Two more points of confusion:
Why does switching barriers imply that electrical potential energy is probably being converted to heat? I don’t see how that follows at all.
To what extent do information storage requirements weigh on FLOPS requirements? It’s not obvious to me that requirements on energy barriers for long-term storage in thermodynamic equilibrium necessarily bear on transient representations of information in the midst of computations, either because the system is out of thermodynamic equilibrium or because storage times are very short
Why does switching barriers imply that electrical potential energy is probably being converted to heat? I don’t see how that follows at all.
Where else is the energy going to go? Again, in an adiabatic device where you have a lot of time to discharge capacitors and such, you might be able to do everything in a way that conserves free energy. I just don’t see how that’s going to work when you’re (for example) switching transistors on and off at a high frequency. It seems to me that the only place to get rid of the electrical potential energy that quickly is to convert it into heat or radiation.
I think what I’m saying is standard in how people analyze power costs of switching in transistors, see e.g. this physics.se post. If you have a proposal for how you think the brain could actually be working to be much more energy efficient than this, I would like to see some details of it, because I’ve certainly not come across anything like that before.
To what extent do information storage requirements weigh on FLOPS requirements? It’s not obvious to me that requirements on energy barriers for long-term storage in thermodynamic equilibrium necessarily bear on transient representations of information in the midst of computations, either because the system is out of thermodynamic equilibrium or because storage times are very short
The Boltzmann factor roughly gives you the steady-state distribution of the associated two-state Markov chain, so if time delays are short it’s possible this would be irrelevant. However, I think that in realistic devices the Markov chain reaches equilibrium far too quickly for you to get around the thermodynamic argument because the system is out of equilibrium.
My reasoning here is that the Boltzmann factor also gives you the odds of an electron having enough kinetic energy to cross the potential barrier upon colliding with it, so e.g. if you imagine an electron stuck in a potential well that’s O(k_B T) deep, the electron will only need to collide with one of the barriers O(1) times to escape. So the rate of convergence to equilibrium comes down to the length of the well divided by the thermal speed of the electron, which is going to be quite rapid as electrons at the Fermi level in a typical wire move at speeds comparable to 1000 km/s.
I can try to calculate exactly what you should expect the convergence time here to be for some configuration you have in mind, but I’m reasonably confident when the energies involved are comparable to the Landauer bit energy this convergence happens quite rapidly for any kind of realistic device.
Why does switching barriers imply that electrical potential energy is probably being converted to heat? I don’t see how that follows at all.
Where else is the energy going to go?
What is “the energy” that has to go somewhere? As you recognize, there’s nothing that says it costs energy to change the shape of a potential well. I’m genuinely not sure what energy you’re talking about here. Is it electrical potential energy spent polarizing a medium?
I think what I’m saying is standard in how people analyze power costs of switching in transistors, see e.g. this physics.se post.
Yeah, that’s pretty standard. The ultimate efficiency limit for a semiconductor field-effect transistor is bounded by the 60 mV/dec subthreshold swing, and modern tiny transistors have to deal with all sorts of problems like leakage current which make it difficult to even reach that limit.
Unclear to me that semiconductor field-effect transistors have anything to do with neurons, but I don’t know how neurons work, so my confusion is more likely a state of my mind than a state of the world.
I don’t think transistors have too much to do with neurons beyond the abstract observation that neurons most likely store information by establishing gradients of potential energy. When the stored information needs to be updated, that means some gradients have to get moved around, and if I had to imagine how this works inside a cell it would probably involve some kind of proton pump operating across a membrane or something like that. That’s going to be functionally pretty similar to a capacitor, and discharging & recharging it probably carries similar free energy costs.
I think what I don’t understand is why you’re defaulting to the assumption that the brain has a way to store and update information that’s much more efficient than what we’re able to do. That doesn’t sound like a state of ignorance to me; it seems like you wouldn’t hold this belief if you didn’t think there was a good reason to do so.
I think what I don’t understand is why you’re defaulting to the assumption that the brain has a way to store and update information that’s much more efficient than what we’re able to do. That doesn’t sound like a state of ignorance to me; it seems like you wouldn’t hold this belief if you didn’t think there was a good reason to do so.
It’s my assumption because our brains are AGI for ~20 W.
In contrast, many kW of GPUs are not AGI.
Therefore, it seems like brains have a way of storing and updating information that’s much more efficient than what we’re able to do.
Of course, maybe I’m wrong and it’s due to a lack of training or lack of data or lack of algorithms, rather than lack of hardware.
DNA storage is way more information dense than hard drives, for example.
It’s my assumption because our brains are AGI for ~20 W.
I think that’s probably the crux. I think the evidence that the brain is not performing that much computation is reasonably good, so I attribute the difference to algorithmic advantages the brain has, particularly ones that make the brain more data efficient relative to today’s neural networks.
The brain being more data efficient I think is hard to dispute, but of course you can argue that this is simply because the brain is doing a lot more computation internally to process the limited amount of data it does see. I’m more ready to believe that the brain has some software advantage over neural networks than to believe that it has an enormous hardware advantage.
Moreover, this is an estimate of effective FLOP, meaning that Cotra takes into account the possibility that software efficiency progress can reduce the physical computational cost of training a TAI system in the future. It was also in units of 2020 FLOP, and we’re already in 2023, so just on that basis alone, these numbers should get adjusted downwards now.
Isn’t it a noted weakness of Cotra’s approach that most of the anchors don’t actually depend on 2020 architecture or algorithmic performance in any concrete way? As in, if the same method were applied today, it would produce the same numbers in “2023 FLOP”? This is related to why I think the Beniaguev paper is pretty relevant exactly as evidence of “inefficiency of our algorithms compared to the human brain”.
If I understand correctly, the claim isn’t necessarily that the brain is “doing” that many FLOP/s, but that using floating point operations on GPUs to do the amount of computation that the brain does (to achieve the same results) is very inefficient. The authors cite Single cortical neurons as deep artificial neural networks (Beniaguev et al. 2021), writing, “A recent attempt by Beniaguev et al to estimate the computational complexity of a biological neuron used neural networks to predict in-vitro data on the signal activity of a pyramidal neuron (the most common kind in the human brain) and found that it took a neural network with about 1000 computational “neurons” and hundreds of thousands of parameters, trained on a modern GPU for several days, to replicate its function.” If you want to use a neural network to do the same thing as a cortical neuron, then one way to do it is, following Beniaguev et al., to run a 7-layer, width-128 temporal convolutional network with 150 ms memory every millisecond. A central estimate of 1e32 FLOP to get the equivalent of 30 years of learning (1e9 seconds) with 1e15 biological neurons synapses does seem reasonable from there. (With 4 inputs/filters, 1015×109×103×(7×128×150×4)≈5×1032, if I haven’t confused myself.)
That does imply the estimate is an upper bound on computational costs to emulate a neuron with an artificial neural network, although the authors argue that it’s likely fairly tight. It also implies the brain is doing its job much more efficiently than we know how to use an A100 to do it, but I’m not sure why that should be particularly surprising. It’s also true that for some tasks we already know how to do much better than emulating a brain.
Recapitulating the response of Steven Byrnes to this argument: it may be very expensive computationally to simulate a computer in a faithful way, but that doesn’t mean it’s expensive to do the same computation that the computer in question is doing. Paraphrasing a nice quote from Richard Borcherds, it may be that teapots are very hard to simulate on a classical computer, but that doesn’t mean that they are useful computational devices.
If we tried to simulate a GPU doing a simple matrix multiplication at high physical fidelity, we would have to take so many factors into account that the cost of our simulation would far exceed the cost of running the GPU itself. Similarly, if we tried to program a physically realistic simulation of the human brain, I have no doubt that the computational cost of doing so would be enormous.
However, this is not what we’re interested in doing. We’re interested in creating a computer that’s doing the same kind of computation as the brain, and the amount of useful computation that the brain could be doing per second is much less than 1e25 or even 1e20 FLOP/s. If your point is that 1e25 FLOP/s is an upper bound on how much computation the brain is doing, I agree, but there’s no reason to think it’s a tight upper bound.
It also implies the brain is doing its job much more efficiently than we know how to use an A100 to do it, but I’m not sure why that should be particularly surprising.
This claim is different from the claim that the brain is doing 1e20 FLOP/s of useful computation, which is the claim that the authors actually make. If you have an object that implements some efficient algorithm that you don’t understand, the object can be doing little useful computation even though you would need much greater amounts of computation to match its performance with a worse algorithm. The estimates coming from the brain are important because they give us a sense of how much software efficiency progress ought to be possible here.
My argument from the Landauer limit is about the number of bit erasures and doesn’t depend on the software being implemented by the brain vs. a GPU. If the brain is doing something that’s in some sense equivalent to 1e20 floating point operations per second, based on its power consumption that would imply that it’s operating basically at the Landauer limit, perhaps only one order of magnitude off. Just the huge amount of noise in brain interconnect should be enough to discredit this estimate. Whether the brain is specialized to perform one or another kind of task is not relevant for this calculation.
Perhaps you think the brain has massive architectural or algorithmic advantages over contemporary neural networks, but if you do, that is a position that has to be defended on very different grounds than “it would take X amount of FLOP/s to simulate one neuron at a high physical fidelity”.
If we tried to simulate a GPU doing a simple matrix multiplication at high physical fidelity, we would have to take so many factors into account that the cost of our simulation would far exceed the cost of running the GPU itself. Similarly, if we tried to program a physically realistic simulation of the human brain, I have no doubt that the computational cost of doing so would be enormous.
The Beniaguev paper does not attempt to simulate neurons at high physical fidelity. It merely attempts to simulate their outputs, which is a far simpler task. I am in total agreement with you that the computation needed to simulate a system is entirely distinct from the computation being performed by that system. Simulating a human brain would require vastly more than 1e21 FLOPS.
This claim is different from the claim that the brain is doing 1e20 FLOP/s of useful computation, which is the claim that the authors actually make.
Is it? I suppose they don’t say so explicitly, but it sounds like they’re using “2020-equivalent” FLOPs (or whatever it is Cotra and Carlsmith use), which has room for “algorithmic progress” baked in.
Perhaps you think the brain has massive architectural or algorithmic advantages over contemporary neural networks, but if you do, that is a position that has to be defended on very different grounds than “it would take X amount of FLOP/s to simulate one neuron at a high physical fidelity”.
I may be reading the essay wrong, but I think this is the claim being made and defended. “Simulating” a neuron at any level of physical detail is going to be irrelevantly difficult, and indeed in Beniaguev et al., running a DNN on a GPU that implements the computation a neuron is doing (four binary inputs, one output) is a 2000X speedup over solving PDEs (a combination of compression and hardware/software). They find it difficult to make the neural network smaller or shorter-memory, suggesting it’s hard to implement the same computation more efficiently with current methods.
I think you’re just reading the essay wrong. In the “executive summary” section, they explicitly state that
Our best anchor for how much compute an AGI needs is the human brain, which we estimate to perform 1e20–1e21 FLOPS.
and
In addition, we estimate that today’s computer hardware is ~5 orders of magnitude less cost efficient and energy efficient than brains.
I don’t know how you read those claims and arrived at your interpretation, and indeed I don’t know how the evidence they provide could support the interpretation you’re talking about. It would also be a strange omission to not mention the “effective” part of “effective FLOP” explicitly if that’s actually what you’re talking about.
Thanks, I see. I agree that a lot of confusion could be avoided with clearer language, but I think at least that they’re not making as simple an error as you describe in the root comment. Ted does say in the EA Forum thread that they don’t believe brains operate at the Landauer limit, but I’ll let him chime in here if he likes.
I think the “effective FLOP” concept is very muddy, but I’m even less sure what it would mean to alternatively describe what the brain is doing in “absolute” FLOPs. Meanwhile, the model they’re using gives a relatively well-defined equivalence between the logical function of the neuron and modern methods on a modern GPU.
The statement about cost and energy efficiency as they elaborate in the essay body is about getting human-equivalent task performance relative to paying a human worker $25/hour, not saying that the brain uses five orders of magnitude less energy per FLOP of any kind. Closing that gap of five orders of magnitude could come either from doing less computation than the logical-equivalent-neural-network or from decreasing the cost of computation.
Just to pick on the step that gets the lowest probability in your calculation, estimating that the human brain does 1e20 FLOP/s with only 20 W of power consumption requires believing that the brain is basically operating at the bitwise Landauer limit, which is around 3e20 bit erasures per watt per second at room temperature. If the FLOP we’re talking about here is equivalent of operations on 8-bit floating point numbers, for example, the human brain would have an energy efficiency of around 1e20 bit erasures per watt, which is less than one order of magnitude from the Landauer limit at room temperature of 300 K.
Needless to say, I find this estimate highly unrealistic. We have no idea how to build practical densely packed devices which get anywhere close to this limit; the best we can do at the moment is perhaps 5 orders of magnitude away. Are you really thinking that the human brain is 5 OOM more energy efficient than an A100?
Still, even this estimate is much more realistic than your claim that the human brain might take 8e34 FLOP to train, which ascribes a ludicrous ~ 1e26 FLOP/s computation capacity to the human brain if this training happens over 20 years. This obviously violates the Landauer limit on computation and so is going to be simply false, unless you think the human brain loses less than one bit of information per 1e5 floating point operations it’s doing. Good luck with that.
I notice that Steven Byrnes has already made the argument that these estimates are poor, but I want to hammer home the point that they are not just poor, they are crazy. Obviously, a mistake this flagrant does not inspire confidence in the rest of the argument.
Let me try writing out some estimates. My math is different than yours.
An H100 SXM has:
8e10 transistors
2e9 Hz boost frequency of
2e15 FLOPS at FP16
7e2 W of max power consumption
Therefore:
2e6 eV are spent per FP16 operation
This is 1e8 times higher than the Landauer limit of 2e-2 eV per bit erasure at 70 C (and the ratio of bit erasures per FP16 operation is unclear to me; let’s pretend it’s O(1))
An H100 performs 1e6 FP16 operations per clock cycle, which implies 8e4 transistors per FP16 operation (some of which may be inactive, of course)
This seems pretty inefficient to me!
To recap, modern chips are roughly ~8 orders of magnitude worse than the Landauer limit (with a bit erasure per FP16 operation fudge factor that isn’t going to exceed 10). And this is in a configuration that takes 8e4 transistors to support a single FP16 operation!
Positing that brains are ~6 orders of magnitude more energy efficient than today’s transistor circuits doesn’t seem at all crazy to me. ~6 orders of improvement on 2e6 is ~2 eV per operation, still two orders of magnitude above the 0.02 eV per bit erasure Landauer limit.
I’ll note too that cells synthesize informative sequences from nucleic acids using less than 1 eV of free energy per bit. That clearly doesn’t violate Landauer or any laws of physics, because we know it happens.
2e-2 eV for the Landauer limit is right, but 2e6 eV per FP16 operation is off by one order of magnitude. (70 W)/(2e15 FLOP/s) = 0.218 MeV. So the gap is 7 orders of magnitude assuming one bit erasure per FLOP.This is wrong, the power consumption is 700 W so the gap is indeed 8 orders of magnitude.
8e10 * 2e9 = 1.6e20 transistor switches per second. This happens with a power consumption of 700 W, suggesting that each switch dissipates on the order of 30 eV of energy, which is only 3 OOM or so from the Landauer limit. So this device is actually not that inefficient if you look only at how efficiently it’s able to perform switches. My position is that you should not expect the brain to be much more efficient than this, though perhaps gaining one or two orders of magnitude is possible with complex error correction methods.
Of course, the transistors supporting per FLOP and the switching frequency gap have to add up to the 8 OOM overall efficiency gap we’ve calculated. However, it’s important that most of the inefficiency comes from the former and not the latter. I’ll elaborate on this later in the comment.
I agree an H100 SXM is not a very efficient computational device. I never said modern GPUs represent the pinnacle of energy efficiency in computation or anything like that, though similar claims have previously been made by others on the forum.
Here we’re talking about the brain possibly doing 1e20 FLOP/s, which I’ve previously said is maybe within one order of magnitude of the Landauer limit or so, and not the more extravagant figure of 1e25 FLOP/s. The disagreement here is not about math; we both agree that this performance requires the brain to be 1 or 2 OOM from the bitwise Landauer limit depending on exactly how many bit erasures you think are involved in a single 16-bit FLOP.
The disagreement is more about how close you think the brain can come to this limit. Most of the energy losses in modern GPUs come from the enormous amounts of noise that you need to deal with in interconnects that are closely packed together. To get anywhere close to the bitwise Landauer limit, you need to get rid of all of these losses. This is what would be needed to lower the amount of transistors supporting per FLOP without also simultaneously increasing the power consumption of the device.
I just don’t see how the brain could possibly pull that off. The design constraints are pretty similar in both cases, and the brain is not using some unique kind of material or architecture which could eliminate dissipative or radiative energy losses in the system. Just as information needs to get carried around inside a GPU, information also needs to move inside the brain, and moving information around in a noisy environment is costly. So I would expect by default that the brain is many orders of magnitude from the Landauer limit, though I can see estimates as high as 1e17 FLOP/s being plausible if the brain is highly efficient. I just think you’ll always be losing many orders of magnitude relative to Landauer as long as your system is not ideal, and the brain is far from an ideal system.
I don’t think you’ll lose as much relative to Landauer when you’re doing that, because you don’t have to move a lot of information around constantly. Transcribing a DNA sequence and other similar operations are local. The reason I think realistic devices will fall far short of Landauer is because of the problem of interconnect: computations cannot be localized effectively, so different parts of your hardware need to talk to each other, and that’s where you lose most of the energy. In terms of pure switching efficiency of transistors, we’re already pretty close to this kind of biological process, as I’ve calculated above.
One potential advantage of the brain is that it is 3D, whereas chips are mostly 2D. I wonder what advantage that confers. Presumably getting information around is much easier with 50% more dimensions.
Probably true, and this could mean the brain has some substantial advantage over today’s hardware (like 1 OOM, say) but at the same time the internal mechanisms that biology uses to establish electrical potential energy gradients and so forth seem so inefficient. Quoting Eliezer;
Max power is 700 W, not 70 W. These chips are water-cooled beasts. Your estimate is off, not mine.
Huh, I wonder why I read 7e2 W as 70 W. Strange mistake.
No worries. I’ve made far worse. I only wish that H100s could operate at a gentle 70 W! :)
I’m posting this as a separate comment because it’s a different line of argument, but I think we should also keep it in mind when making estimates of how much computation the brain could actually be using.
If the brain is operating at a frequency of (say) 10 Hz and is doing 1e20 FLOP/s, that suggests the brain has something like 1e19 floating point parameters, or maybe specifying the “internal state” of the brain takes something like 1e20 bits. If you want to properly train a neural network of this size, you need to update on a comparable amount of useful entropy from the outside world. This means you have to believe that humans are receiving on the order of 1e11 bits or 10 GB of useful information about the world to update on every second if the brain is to be “fully trained” by the age of 30, say.
An estimate of 1e15 FLOP/s brings this down to a more realistic 100 KB or so, which still seems like a lot but is somewhat more believable if you consider the potential information content of visual and auditory stimuli. I think even this is an overestimate and that the brain has some algorithmic insights which make it somewhat more data efficient than contemporary neural networks, but I think the gap implied by 1e20 FLOP/s is rather too large for me to believe it.
Thanks for the constructive comments. I’m open-minded to being wrong here. I’ve already updated a bit and I’m happy to update more.
Regarding the Landauer limit, I’m confused by a few things:
First, I’m confused by your linkage between floating point operations and information erasure. For example, if we have two 8-bit registers (A, B) and multiply to get (A, B*A), we’ve done an 8-bit floating point operation without 8 bits of erasure. It seems quite plausible to be that the brain does 1e20 FLOPS but with a much smaller rate of bit erasures.
Second, I have no idea how to map the fidelity of brain operations to floating point precision, so I really don’t know if we should be comparing 1 bit, 8 bit, 64 bit, or not at all. Any ideas?
Regarding training requiring 8e34 floating point operations:
Ajeya Cotra estimates training could take anything from 1e24 to 1e54 floating point operations, or even more. Her narrower lifetime anchor ranges from 1e24 to 1e38ish. https://docs.google.com/document/d/1IJ6Sr-gPeXdSJugFulwIpvavc0atjHGM82QjIfUSBGQ/edit
Do you think Cotra’s estimates are not just poor, but crazy as well? If they were crazy, I would have expected to see her two-year update mention the mistake, or the top comments to point it out, but I see neither: https://www.lesswrong.com/posts/AfH2oPHCApdKicM4m/two-year-update-on-my-personal-ai-timelines
As a minor nitpick, if A and B are 8-bit floating point numbers then the multiplication map x → B*x is almost never injective. This means even in your idealized setup, the operation (A, B) → (A, B*A) is going to lose some information, though I agree that this information loss will be << 8 bits, probably more like 1 bit amortized or so.
The bigger problem is that logical reversibility doesn’t imply physical reversibility. I can think of ways in which we could set up sophisticated classical computation devices which are logically reversible, and perhaps could be made approximately physically reversible when operating in a near-adiabatic regime at low frequencies, but the brain is not operating in this regime (especially if it’s performing 1e20 FLOP/s). At high frequencies, I just don’t see which architecture you have in mind to perform lots of 8-bit floating point multiplications without raising the entropy of the environment by on the order of 8 bits.
Again using your setup, if you actually tried to implement (A, B) → (A, A*B) on a physical device, you would need to take the register that is storing B and replace the stored value with A*B instead. To store 1 bit of information you need a potential energy barrier that’s at least as high as k_B T log(2), so you need to switch ~ 8 such barriers, which means in any kind of realistic device you’ll lose ~ 8 k_B T log(2) of electrical potential energy to heat, either through resistance or through radiation. It doesn’t have to be like this, and some idealized device could do better, but GPUs are not idealized devices and neither are brains.
Two points about that:
This is a measure that takes into account the uncertainty over how much less efficient our software is compared to the human brain. I agree that human lifetime learning compute being around 1e25 FLOP is not strong evidence that the first TAI system we train will use 1e25 FLOP of compute; I expect it to take significantly more than that.
Moreover, this is an estimate of effective FLOP, meaning that Cotra takes into account the possibility that software efficiency progress can reduce the physical computational cost of training a TAI system in the future. It was also in units of 2020 FLOP, and we’re already in 2023, so just on that basis alone, these numbers should get adjusted downwards now.
No, because Cotra doesn’t claim that the human brain performs 1e25 FLOP/s—her claim is quite different.
The claim that “the first AI system to match the performance of the human brain might require 1e25 FLOP/s to run” is not necessarily crazy, though it needs to be supported by evidence of the relative inefficiency of our algorithms compared to the human brain and by estimates of how much software progress we should expect to be made in the future.
Thanks, that’s clarifying. (And yes, I’m well aware that x → B*x is almost never injective, which is why I said it wouldn’t cause 8 bits of erasure rather than the stronger, incorrect claim of 0 bits of erasure.)
Two more points of confusion:
Why does switching barriers imply that electrical potential energy is probably being converted to heat? I don’t see how that follows at all.
To what extent do information storage requirements weigh on FLOPS requirements? It’s not obvious to me that requirements on energy barriers for long-term storage in thermodynamic equilibrium necessarily bear on transient representations of information in the midst of computations, either because the system is out of thermodynamic equilibrium or because storage times are very short
Where else is the energy going to go? Again, in an adiabatic device where you have a lot of time to discharge capacitors and such, you might be able to do everything in a way that conserves free energy. I just don’t see how that’s going to work when you’re (for example) switching transistors on and off at a high frequency. It seems to me that the only place to get rid of the electrical potential energy that quickly is to convert it into heat or radiation.
I think what I’m saying is standard in how people analyze power costs of switching in transistors, see e.g. this physics.se post. If you have a proposal for how you think the brain could actually be working to be much more energy efficient than this, I would like to see some details of it, because I’ve certainly not come across anything like that before.
The Boltzmann factor roughly gives you the steady-state distribution of the associated two-state Markov chain, so if time delays are short it’s possible this would be irrelevant. However, I think that in realistic devices the Markov chain reaches equilibrium far too quickly for you to get around the thermodynamic argument because the system is out of equilibrium.
My reasoning here is that the Boltzmann factor also gives you the odds of an electron having enough kinetic energy to cross the potential barrier upon colliding with it, so e.g. if you imagine an electron stuck in a potential well that’s O(k_B T) deep, the electron will only need to collide with one of the barriers O(1) times to escape. So the rate of convergence to equilibrium comes down to the length of the well divided by the thermal speed of the electron, which is going to be quite rapid as electrons at the Fermi level in a typical wire move at speeds comparable to 1000 km/s.
I can try to calculate exactly what you should expect the convergence time here to be for some configuration you have in mind, but I’m reasonably confident when the energies involved are comparable to the Landauer bit energy this convergence happens quite rapidly for any kind of realistic device.
What is “the energy” that has to go somewhere? As you recognize, there’s nothing that says it costs energy to change the shape of a potential well. I’m genuinely not sure what energy you’re talking about here. Is it electrical potential energy spent polarizing a medium?
Yeah, that’s pretty standard. The ultimate efficiency limit for a semiconductor field-effect transistor is bounded by the 60 mV/dec subthreshold swing, and modern tiny transistors have to deal with all sorts of problems like leakage current which make it difficult to even reach that limit.
Unclear to me that semiconductor field-effect transistors have anything to do with neurons, but I don’t know how neurons work, so my confusion is more likely a state of my mind than a state of the world.
I don’t think transistors have too much to do with neurons beyond the abstract observation that neurons most likely store information by establishing gradients of potential energy. When the stored information needs to be updated, that means some gradients have to get moved around, and if I had to imagine how this works inside a cell it would probably involve some kind of proton pump operating across a membrane or something like that. That’s going to be functionally pretty similar to a capacitor, and discharging & recharging it probably carries similar free energy costs.
I think what I don’t understand is why you’re defaulting to the assumption that the brain has a way to store and update information that’s much more efficient than what we’re able to do. That doesn’t sound like a state of ignorance to me; it seems like you wouldn’t hold this belief if you didn’t think there was a good reason to do so.
It’s my assumption because our brains are AGI for ~20 W.
In contrast, many kW of GPUs are not AGI.
Therefore, it seems like brains have a way of storing and updating information that’s much more efficient than what we’re able to do.
Of course, maybe I’m wrong and it’s due to a lack of training or lack of data or lack of algorithms, rather than lack of hardware.
DNA storage is way more information dense than hard drives, for example.
I think that’s probably the crux. I think the evidence that the brain is not performing that much computation is reasonably good, so I attribute the difference to algorithmic advantages the brain has, particularly ones that make the brain more data efficient relative to today’s neural networks.
The brain being more data efficient I think is hard to dispute, but of course you can argue that this is simply because the brain is doing a lot more computation internally to process the limited amount of data it does see. I’m more ready to believe that the brain has some software advantage over neural networks than to believe that it has an enormous hardware advantage.
Isn’t it a noted weakness of Cotra’s approach that most of the anchors don’t actually depend on 2020 architecture or algorithmic performance in any concrete way? As in, if the same method were applied today, it would produce the same numbers in “2023 FLOP”? This is related to why I think the Beniaguev paper is pretty relevant exactly as evidence of “inefficiency of our algorithms compared to the human brain”.
If I understand correctly, the claim isn’t necessarily that the brain is “doing” that many FLOP/s, but that using floating point operations on GPUs to do the amount of computation that the brain does (to achieve the same results) is very inefficient. The authors cite Single cortical neurons as deep artificial neural networks (Beniaguev et al. 2021), writing, “A recent attempt by Beniaguev et al to estimate the computational complexity of a biological neuron used neural networks to predict in-vitro data on the signal activity of a pyramidal neuron (the most common kind in the human brain) and found that it took a neural network with about 1000 computational “neurons” and hundreds of thousands of parameters, trained on a modern GPU for several days, to replicate its function.” If you want to use a neural network to do the same thing as a cortical neuron, then one way to do it is, following Beniaguev et al., to run a 7-layer, width-128 temporal convolutional network with 150 ms memory every millisecond. A central estimate of 1e32 FLOP to get the equivalent of 30 years of learning (1e9 seconds) with 1e15 biological
neuronssynapses does seem reasonable from there. (With 4 inputs/filters, 1015×109×103×(7×128×150×4)≈5×1032, if I haven’t confused myself.)That does imply the estimate is an upper bound on computational costs to emulate a neuron with an artificial neural network, although the authors argue that it’s likely fairly tight. It also implies the brain is doing its job much more efficiently than we know how to use an A100 to do it, but I’m not sure why that should be particularly surprising. It’s also true that for some tasks we already know how to do much better than emulating a brain.
Recapitulating the response of Steven Byrnes to this argument: it may be very expensive computationally to simulate a computer in a faithful way, but that doesn’t mean it’s expensive to do the same computation that the computer in question is doing. Paraphrasing a nice quote from Richard Borcherds, it may be that teapots are very hard to simulate on a classical computer, but that doesn’t mean that they are useful computational devices.
If we tried to simulate a GPU doing a simple matrix multiplication at high physical fidelity, we would have to take so many factors into account that the cost of our simulation would far exceed the cost of running the GPU itself. Similarly, if we tried to program a physically realistic simulation of the human brain, I have no doubt that the computational cost of doing so would be enormous.
However, this is not what we’re interested in doing. We’re interested in creating a computer that’s doing the same kind of computation as the brain, and the amount of useful computation that the brain could be doing per second is much less than 1e25 or even 1e20 FLOP/s. If your point is that 1e25 FLOP/s is an upper bound on how much computation the brain is doing, I agree, but there’s no reason to think it’s a tight upper bound.
This claim is different from the claim that the brain is doing 1e20 FLOP/s of useful computation, which is the claim that the authors actually make. If you have an object that implements some efficient algorithm that you don’t understand, the object can be doing little useful computation even though you would need much greater amounts of computation to match its performance with a worse algorithm. The estimates coming from the brain are important because they give us a sense of how much software efficiency progress ought to be possible here.
My argument from the Landauer limit is about the number of bit erasures and doesn’t depend on the software being implemented by the brain vs. a GPU. If the brain is doing something that’s in some sense equivalent to 1e20 floating point operations per second, based on its power consumption that would imply that it’s operating basically at the Landauer limit, perhaps only one order of magnitude off. Just the huge amount of noise in brain interconnect should be enough to discredit this estimate. Whether the brain is specialized to perform one or another kind of task is not relevant for this calculation.
Perhaps you think the brain has massive architectural or algorithmic advantages over contemporary neural networks, but if you do, that is a position that has to be defended on very different grounds than “it would take X amount of FLOP/s to simulate one neuron at a high physical fidelity”.
The Beniaguev paper does not attempt to simulate neurons at high physical fidelity. It merely attempts to simulate their outputs, which is a far simpler task. I am in total agreement with you that the computation needed to simulate a system is entirely distinct from the computation being performed by that system. Simulating a human brain would require vastly more than 1e21 FLOPS.
Is it? I suppose they don’t say so explicitly, but it sounds like they’re using “2020-equivalent” FLOPs (or whatever it is Cotra and Carlsmith use), which has room for “algorithmic progress” baked in.
I may be reading the essay wrong, but I think this is the claim being made and defended. “Simulating” a neuron at any level of physical detail is going to be irrelevantly difficult, and indeed in Beniaguev et al., running a DNN on a GPU that implements the computation a neuron is doing (four binary inputs, one output) is a 2000X speedup over solving PDEs (a combination of compression and hardware/software). They find it difficult to make the neural network smaller or shorter-memory, suggesting it’s hard to implement the same computation more efficiently with current methods.
I think you’re just reading the essay wrong. In the “executive summary” section, they explicitly state that
and
I don’t know how you read those claims and arrived at your interpretation, and indeed I don’t know how the evidence they provide could support the interpretation you’re talking about. It would also be a strange omission to not mention the “effective” part of “effective FLOP” explicitly if that’s actually what you’re talking about.
Thanks, I see. I agree that a lot of confusion could be avoided with clearer language, but I think at least that they’re not making as simple an error as you describe in the root comment. Ted does say in the EA Forum thread that they don’t believe brains operate at the Landauer limit, but I’ll let him chime in here if he likes.
I think the “effective FLOP” concept is very muddy, but I’m even less sure what it would mean to alternatively describe what the brain is doing in “absolute” FLOPs. Meanwhile, the model they’re using gives a relatively well-defined equivalence between the logical function of the neuron and modern methods on a modern GPU.
The statement about cost and energy efficiency as they elaborate in the essay body is about getting human-equivalent task performance relative to paying a human worker $25/hour, not saying that the brain uses five orders of magnitude less energy per FLOP of any kind. Closing that gap of five orders of magnitude could come either from doing less computation than the logical-equivalent-neural-network or from decreasing the cost of computation.