First, I’m confused by your linkage between floating point operations and information erasure. For example, if we have two 8-bit registers (A, B) and multiply to get (A, B*A), we’ve done an 8-bit floating point operation without 8 bits of erasure. It seems quite plausible to be that the brain does 1e20 FLOPS but with a much smaller rate of bit erasures.
As a minor nitpick, if A and B are 8-bit floating point numbers then the multiplication map x → B*x is almost never injective. This means even in your idealized setup, the operation (A, B) → (A, B*A) is going to lose some information, though I agree that this information loss will be << 8 bits, probably more like 1 bit amortized or so.
The bigger problem is that logical reversibility doesn’t imply physical reversibility. I can think of ways in which we could set up sophisticated classical computation devices which are logically reversible, and perhaps could be made approximately physically reversible when operating in a near-adiabatic regime at low frequencies, but the brain is not operating in this regime (especially if it’s performing 1e20 FLOP/s). At high frequencies, I just don’t see which architecture you have in mind to perform lots of 8-bit floating point multiplications without raising the entropy of the environment by on the order of 8 bits.
Again using your setup, if you actually tried to implement (A, B) → (A, A*B) on a physical device, you would need to take the register that is storing B and replace the stored value with A*B instead. To store 1 bit of information you need a potential energy barrier that’s at least as high as k_B T log(2), so you need to switch ~ 8 such barriers, which means in any kind of realistic device you’ll lose ~ 8 k_B T log(2) of electrical potential energy to heat, either through resistance or through radiation. It doesn’t have to be like this, and some idealized device could do better, but GPUs are not idealized devices and neither are brains.
Ajeya Cotra estimates training could take anything from 1e24 to 1e54 floating point operations, or even more. Her narrower lifetime anchor ranges from 1e24 to 1e38ish.
Two points about that:
This is a measure that takes into account the uncertainty over how much less efficient our software is compared to the human brain. I agree that human lifetime learning compute being around 1e25 FLOP is not strong evidence that the first TAI system we train will use 1e25 FLOP of compute; I expect it to take significantly more than that.
Moreover, this is an estimate of effective FLOP, meaning that Cotra takes into account the possibility that software efficiency progress can reduce the physical computational cost of training a TAI system in the future. It was also in units of 2020 FLOP, and we’re already in 2023, so just on that basis alone, these numbers should get adjusted downwards now.
Do you think Cotra’s estimates are not just poor, but crazy as well?
No, because Cotra doesn’t claim that the human brain performs 1e25 FLOP/s—her claim is quite different.
The claim that “the first AI system to match the performance of the human brain might require 1e25 FLOP/s to run” is not necessarily crazy, though it needs to be supported by evidence of the relative inefficiency of our algorithms compared to the human brain and by estimates of how much software progress we should expect to be made in the future.
Thanks, that’s clarifying. (And yes, I’m well aware that x → B*x is almost never injective, which is why I said it wouldn’t cause 8 bits of erasure rather than the stronger, incorrect claim of 0 bits of erasure.)
To store 1 bit of information you need a potential energy barrier that’s at least as high as k_B T log(2), so you need to switch ~ 8 such barriers, which means in any kind of realistic device you’ll lose ~ 8 k_B T log(2) of electrical potential energy to heat, either through resistance or through radiation. It doesn’t have to be like this, and some idealized device could do better, but GPUs are not idealized devices and neither are brains.
Two more points of confusion:
Why does switching barriers imply that electrical potential energy is probably being converted to heat? I don’t see how that follows at all.
To what extent do information storage requirements weigh on FLOPS requirements? It’s not obvious to me that requirements on energy barriers for long-term storage in thermodynamic equilibrium necessarily bear on transient representations of information in the midst of computations, either because the system is out of thermodynamic equilibrium or because storage times are very short
Why does switching barriers imply that electrical potential energy is probably being converted to heat? I don’t see how that follows at all.
Where else is the energy going to go? Again, in an adiabatic device where you have a lot of time to discharge capacitors and such, you might be able to do everything in a way that conserves free energy. I just don’t see how that’s going to work when you’re (for example) switching transistors on and off at a high frequency. It seems to me that the only place to get rid of the electrical potential energy that quickly is to convert it into heat or radiation.
I think what I’m saying is standard in how people analyze power costs of switching in transistors, see e.g. this physics.se post. If you have a proposal for how you think the brain could actually be working to be much more energy efficient than this, I would like to see some details of it, because I’ve certainly not come across anything like that before.
To what extent do information storage requirements weigh on FLOPS requirements? It’s not obvious to me that requirements on energy barriers for long-term storage in thermodynamic equilibrium necessarily bear on transient representations of information in the midst of computations, either because the system is out of thermodynamic equilibrium or because storage times are very short
The Boltzmann factor roughly gives you the steady-state distribution of the associated two-state Markov chain, so if time delays are short it’s possible this would be irrelevant. However, I think that in realistic devices the Markov chain reaches equilibrium far too quickly for you to get around the thermodynamic argument because the system is out of equilibrium.
My reasoning here is that the Boltzmann factor also gives you the odds of an electron having enough kinetic energy to cross the potential barrier upon colliding with it, so e.g. if you imagine an electron stuck in a potential well that’s O(k_B T) deep, the electron will only need to collide with one of the barriers O(1) times to escape. So the rate of convergence to equilibrium comes down to the length of the well divided by the thermal speed of the electron, which is going to be quite rapid as electrons at the Fermi level in a typical wire move at speeds comparable to 1000 km/s.
I can try to calculate exactly what you should expect the convergence time here to be for some configuration you have in mind, but I’m reasonably confident when the energies involved are comparable to the Landauer bit energy this convergence happens quite rapidly for any kind of realistic device.
Why does switching barriers imply that electrical potential energy is probably being converted to heat? I don’t see how that follows at all.
Where else is the energy going to go?
What is “the energy” that has to go somewhere? As you recognize, there’s nothing that says it costs energy to change the shape of a potential well. I’m genuinely not sure what energy you’re talking about here. Is it electrical potential energy spent polarizing a medium?
I think what I’m saying is standard in how people analyze power costs of switching in transistors, see e.g. this physics.se post.
Yeah, that’s pretty standard. The ultimate efficiency limit for a semiconductor field-effect transistor is bounded by the 60 mV/dec subthreshold swing, and modern tiny transistors have to deal with all sorts of problems like leakage current which make it difficult to even reach that limit.
Unclear to me that semiconductor field-effect transistors have anything to do with neurons, but I don’t know how neurons work, so my confusion is more likely a state of my mind than a state of the world.
I don’t think transistors have too much to do with neurons beyond the abstract observation that neurons most likely store information by establishing gradients of potential energy. When the stored information needs to be updated, that means some gradients have to get moved around, and if I had to imagine how this works inside a cell it would probably involve some kind of proton pump operating across a membrane or something like that. That’s going to be functionally pretty similar to a capacitor, and discharging & recharging it probably carries similar free energy costs.
I think what I don’t understand is why you’re defaulting to the assumption that the brain has a way to store and update information that’s much more efficient than what we’re able to do. That doesn’t sound like a state of ignorance to me; it seems like you wouldn’t hold this belief if you didn’t think there was a good reason to do so.
I think what I don’t understand is why you’re defaulting to the assumption that the brain has a way to store and update information that’s much more efficient than what we’re able to do. That doesn’t sound like a state of ignorance to me; it seems like you wouldn’t hold this belief if you didn’t think there was a good reason to do so.
It’s my assumption because our brains are AGI for ~20 W.
In contrast, many kW of GPUs are not AGI.
Therefore, it seems like brains have a way of storing and updating information that’s much more efficient than what we’re able to do.
Of course, maybe I’m wrong and it’s due to a lack of training or lack of data or lack of algorithms, rather than lack of hardware.
DNA storage is way more information dense than hard drives, for example.
It’s my assumption because our brains are AGI for ~20 W.
I think that’s probably the crux. I think the evidence that the brain is not performing that much computation is reasonably good, so I attribute the difference to algorithmic advantages the brain has, particularly ones that make the brain more data efficient relative to today’s neural networks.
The brain being more data efficient I think is hard to dispute, but of course you can argue that this is simply because the brain is doing a lot more computation internally to process the limited amount of data it does see. I’m more ready to believe that the brain has some software advantage over neural networks than to believe that it has an enormous hardware advantage.
Moreover, this is an estimate of effective FLOP, meaning that Cotra takes into account the possibility that software efficiency progress can reduce the physical computational cost of training a TAI system in the future. It was also in units of 2020 FLOP, and we’re already in 2023, so just on that basis alone, these numbers should get adjusted downwards now.
Isn’t it a noted weakness of Cotra’s approach that most of the anchors don’t actually depend on 2020 architecture or algorithmic performance in any concrete way? As in, if the same method were applied today, it would produce the same numbers in “2023 FLOP”? This is related to why I think the Beniaguev paper is pretty relevant exactly as evidence of “inefficiency of our algorithms compared to the human brain”.
As a minor nitpick, if A and B are 8-bit floating point numbers then the multiplication map x → B*x is almost never injective. This means even in your idealized setup, the operation (A, B) → (A, B*A) is going to lose some information, though I agree that this information loss will be << 8 bits, probably more like 1 bit amortized or so.
The bigger problem is that logical reversibility doesn’t imply physical reversibility. I can think of ways in which we could set up sophisticated classical computation devices which are logically reversible, and perhaps could be made approximately physically reversible when operating in a near-adiabatic regime at low frequencies, but the brain is not operating in this regime (especially if it’s performing 1e20 FLOP/s). At high frequencies, I just don’t see which architecture you have in mind to perform lots of 8-bit floating point multiplications without raising the entropy of the environment by on the order of 8 bits.
Again using your setup, if you actually tried to implement (A, B) → (A, A*B) on a physical device, you would need to take the register that is storing B and replace the stored value with A*B instead. To store 1 bit of information you need a potential energy barrier that’s at least as high as k_B T log(2), so you need to switch ~ 8 such barriers, which means in any kind of realistic device you’ll lose ~ 8 k_B T log(2) of electrical potential energy to heat, either through resistance or through radiation. It doesn’t have to be like this, and some idealized device could do better, but GPUs are not idealized devices and neither are brains.
Two points about that:
This is a measure that takes into account the uncertainty over how much less efficient our software is compared to the human brain. I agree that human lifetime learning compute being around 1e25 FLOP is not strong evidence that the first TAI system we train will use 1e25 FLOP of compute; I expect it to take significantly more than that.
Moreover, this is an estimate of effective FLOP, meaning that Cotra takes into account the possibility that software efficiency progress can reduce the physical computational cost of training a TAI system in the future. It was also in units of 2020 FLOP, and we’re already in 2023, so just on that basis alone, these numbers should get adjusted downwards now.
No, because Cotra doesn’t claim that the human brain performs 1e25 FLOP/s—her claim is quite different.
The claim that “the first AI system to match the performance of the human brain might require 1e25 FLOP/s to run” is not necessarily crazy, though it needs to be supported by evidence of the relative inefficiency of our algorithms compared to the human brain and by estimates of how much software progress we should expect to be made in the future.
Thanks, that’s clarifying. (And yes, I’m well aware that x → B*x is almost never injective, which is why I said it wouldn’t cause 8 bits of erasure rather than the stronger, incorrect claim of 0 bits of erasure.)
Two more points of confusion:
Why does switching barriers imply that electrical potential energy is probably being converted to heat? I don’t see how that follows at all.
To what extent do information storage requirements weigh on FLOPS requirements? It’s not obvious to me that requirements on energy barriers for long-term storage in thermodynamic equilibrium necessarily bear on transient representations of information in the midst of computations, either because the system is out of thermodynamic equilibrium or because storage times are very short
Where else is the energy going to go? Again, in an adiabatic device where you have a lot of time to discharge capacitors and such, you might be able to do everything in a way that conserves free energy. I just don’t see how that’s going to work when you’re (for example) switching transistors on and off at a high frequency. It seems to me that the only place to get rid of the electrical potential energy that quickly is to convert it into heat or radiation.
I think what I’m saying is standard in how people analyze power costs of switching in transistors, see e.g. this physics.se post. If you have a proposal for how you think the brain could actually be working to be much more energy efficient than this, I would like to see some details of it, because I’ve certainly not come across anything like that before.
The Boltzmann factor roughly gives you the steady-state distribution of the associated two-state Markov chain, so if time delays are short it’s possible this would be irrelevant. However, I think that in realistic devices the Markov chain reaches equilibrium far too quickly for you to get around the thermodynamic argument because the system is out of equilibrium.
My reasoning here is that the Boltzmann factor also gives you the odds of an electron having enough kinetic energy to cross the potential barrier upon colliding with it, so e.g. if you imagine an electron stuck in a potential well that’s O(k_B T) deep, the electron will only need to collide with one of the barriers O(1) times to escape. So the rate of convergence to equilibrium comes down to the length of the well divided by the thermal speed of the electron, which is going to be quite rapid as electrons at the Fermi level in a typical wire move at speeds comparable to 1000 km/s.
I can try to calculate exactly what you should expect the convergence time here to be for some configuration you have in mind, but I’m reasonably confident when the energies involved are comparable to the Landauer bit energy this convergence happens quite rapidly for any kind of realistic device.
What is “the energy” that has to go somewhere? As you recognize, there’s nothing that says it costs energy to change the shape of a potential well. I’m genuinely not sure what energy you’re talking about here. Is it electrical potential energy spent polarizing a medium?
Yeah, that’s pretty standard. The ultimate efficiency limit for a semiconductor field-effect transistor is bounded by the 60 mV/dec subthreshold swing, and modern tiny transistors have to deal with all sorts of problems like leakage current which make it difficult to even reach that limit.
Unclear to me that semiconductor field-effect transistors have anything to do with neurons, but I don’t know how neurons work, so my confusion is more likely a state of my mind than a state of the world.
I don’t think transistors have too much to do with neurons beyond the abstract observation that neurons most likely store information by establishing gradients of potential energy. When the stored information needs to be updated, that means some gradients have to get moved around, and if I had to imagine how this works inside a cell it would probably involve some kind of proton pump operating across a membrane or something like that. That’s going to be functionally pretty similar to a capacitor, and discharging & recharging it probably carries similar free energy costs.
I think what I don’t understand is why you’re defaulting to the assumption that the brain has a way to store and update information that’s much more efficient than what we’re able to do. That doesn’t sound like a state of ignorance to me; it seems like you wouldn’t hold this belief if you didn’t think there was a good reason to do so.
It’s my assumption because our brains are AGI for ~20 W.
In contrast, many kW of GPUs are not AGI.
Therefore, it seems like brains have a way of storing and updating information that’s much more efficient than what we’re able to do.
Of course, maybe I’m wrong and it’s due to a lack of training or lack of data or lack of algorithms, rather than lack of hardware.
DNA storage is way more information dense than hard drives, for example.
I think that’s probably the crux. I think the evidence that the brain is not performing that much computation is reasonably good, so I attribute the difference to algorithmic advantages the brain has, particularly ones that make the brain more data efficient relative to today’s neural networks.
The brain being more data efficient I think is hard to dispute, but of course you can argue that this is simply because the brain is doing a lot more computation internally to process the limited amount of data it does see. I’m more ready to believe that the brain has some software advantage over neural networks than to believe that it has an enormous hardware advantage.
Isn’t it a noted weakness of Cotra’s approach that most of the anchors don’t actually depend on 2020 architecture or algorithmic performance in any concrete way? As in, if the same method were applied today, it would produce the same numbers in “2023 FLOP”? This is related to why I think the Beniaguev paper is pretty relevant exactly as evidence of “inefficiency of our algorithms compared to the human brain”.