The first step in reducing confusion is to look at what a synaptic spike does. It is the equivalent of—in terms of computational power—an ANN ‘synaptic spike’, which is a memory read of a weight, a low precision MAC (multiply accumulate), and a weight memory write (various neurotransmitter plasticity mechanisms). Some synapses probably do more than this—nonlinear decoding of spike times for example, but that’s a start. This is all implemented in a pretty minimal size looking device. The memory read/write is local, but it also needs to act as an amplifier to some extent, to reduce noise and push the signal farther down the wire. An analog multiplier uses many charge carriers to get a reasonable SNR ratio, which compares to all the charge carries across a digital multiplier including interconnect.
So with that background you can apply the landauer analysis to get base bit energy, then estimate the analog MAC energy cost (or equivalent digital MAC, but the digital MAC is much larger so there are size/energy/speed tradeoffs), and finally consider the probably dominate interconnect cost. I estimate the interconnect cost alone at perhaps a watt.
A complementary approach is to compare to projected upcoming end-of CMOS scaling tech as used in research accelerator designs and see that you end up getting similar numbers (also discussed in the article).
The brain, like current CMOS tech, is completely irreversible. Reversible computation is possible in theory but is exotic like quantum computation requiring near zero temp and may not be practical at scale on a noisy environment like the earth, for the reasons outlined by Cavin/Zhirnov here and discussed in a theoretical cellular model by Tiata here—basically fully reversible computers rapidly forget everything as noise accumulates. Irreversible computers like brains and GPUs erase all thermal noise at every step, and pay the hot iron price to do so.
This does not explain how thousands of neurotransmitter molecules impinging on a neuron and thousands of ions flooding into and out of cell membranes, all irreversible operations, in order to transmit one spike, could possibly be within one OOM of the thermodynamic limit on efficiency for a cognitive system (running at that temperature).
See my reply here which attempts to answer this. In short, if you accept that the synapse is doing the equivalent of all the operations involving a weight in a deep learning system (storing the weight, momentum gradient etc in minimal viable precision, multiplier for forward back and weight update, etc), then the answer is a more straightforward derivation from the requirements. If you are convinced that the synapse is only doing the equivalent of a single bit AND operation, then obviously you will reach the conclusion that it is many OOM wasteful, but tis easy to demolish any notion that is merely doing something so simple.[1]
There are of course many types of synapses which perform somewhat different computations and thus have different configurations, sizes, energy costs, etc. I am mostly referring to the energy/compute dominate cortical pyramidal synapses.
Nothing about any of those claims explains why the 10,000-fold redundancy of neurotransmitter molecules and ions being pumped in and out of the system is necessary for doing the alleged complicated stuff.
Is your point that the amount of neurotransmitter is precisely meaningful (so that spending some energy/heat on pumping one additional ion is doing on the order of a bit of “meaningful work”)?
I’m not sure what you mean precisely by “precisely meaningful”, but I do believe we actually know enough about how neural circuits and synapses work[1] such that we have some confidence that they must be doing something similar to their artificial analogs in DL systems.
So this minimally requires:
storage for a K-bit connection weight in memory
(some synapses) nonlinear decoding of B-bit incoming neural spike signal (timing based)
analog ‘multiplication’[2] of incoming B-bit neural signal by K-bit weight
weight update from local backpropagating hebbian/gradient signal or equivalent
We know from DL that K and B do not need to be very large, but the optimal are well above 1-bit, and more importantly the long term weight storage (equivalent of gradient EMA/momentum) drives most of the precision demand, as it needs to accumulate many noisy measurements over time. From DL it looks like you want around 8-bit at least for long-term weight param storage, even if you can sample down to 4-bit or a bit lower for forward/backwards passes.
So that just takes a certain amount of work, and if you map out the minimal digital circuits in a maximally efficient hypothetical single-electron tile technology you really do get something on order 1e5 minimal 1eV units or more[3]. Synapses are also efficient in the sense that they grow/shrink to physically represent larger/smaller logical weights using more/less resources in the optimal fashion.
I have also argued on the other side of this—there are some DL researchers who think the brain does many many OOM more computation than it would seem, but we can rule that out with the same analysis.
The actual synaptic operations are non-linear and more complex, but do something like the equivalent work of analog multiplication, and can’t be doing dramatically more or less.
Thanks! (I’m having a hard time following your argument as a whole, and I’m also not trying very hard / being lazy / not checking the numbers; but I appreciate your answers, and they’re at least fleshing out some kind of model that feels useful to me. )
Thus the brain is likely doing on order 1014 to 1015 low-medium precision multiply-adds per second.
I don’t understand why this follows, and suspect it is false. Most of these synaptic operations are probably not “correct” multiply-adds of any precision—they’re actually more random, noisier functions that are approximated or modeled by analog MACs with particular input ranges.
And even if each synaptic operation really is doing the equivalent of an arbitrary analog MAC computation, that doesn’t mean that these operations are working together to do any kind of larger computation or cognition in anywhere close to the most efficient possible way.
Similar to how you can prune and distill large artificial models without changing their behavior much, I expect you could get rid of many neurons in the brain without changing the actual computation that it performs much or at all.
It seems like you’re modeling the brain as performing some particular exact computation where every bit of noise is counted as useful work. The fact that the brain may be within a couple OOM of fundamental thermodynamic limits of computing exactly what it happens to compute seems not very meaningful as a measure of the fundamental limit of useful computation or cognition possible given particular size and energy specifications.
I could make the exact same argument about some grad student’s first DL experiment running on a GPU, on multiple levels.
I also suspect you could get rid of many neurons in their DL model without changing the computation, I suspect they aren’t working together to do any kind of larger cognition in anywhere closer to the most efficient possible way.
It’s also likely they may not even know how to use the tensorcores efficiently, and even if they did the tensorcores waste most of their compute multiplying by zeros or near zeroes, regardless of how skilled/knowledge-able the DL practitioner.
And yet knowing all this, we still count flops in the obvious way, as counting “hypothetical fully utilized fllops” is not an easy useful quantity to measure discuss and compare.
Utilization of the compute resources is a higher level software/architecture efficiency consideration, not a hardware efficiency measure.
And yet knowing all this, we still count flops in the obvious way, as counting “hypothetical fully utilized fllops” is not an easy useful quantity to measure discuss and compare.
Given a CPU capable of a specified number of FLOPs at a specified precision, I actually can take arbitrary floats at that precision and multiply or add them in arbitrary ways at the specified rate[1].
Not so for brains, for at least a couple of reasons:
An individual neuron can’t necessarily perform an arbitrary multiply / add / accumulate operation, at any particular precision. It may be modeled by an analog MAC of a specified precision over some input range.
The software / architecture point above. For many artificial computations we care about, we can apply both micro (e.g. assembly code optimization) and macro (e.g. using a non-quadratic algorithm for matrix multiplication) optimization to get pretty close to the theoretical limit of efficiency. Maybe the brain is already doing the analog version of these kinds optimizations in some cases. Yes, this is somewhat of a separate / higher-level consideration, but if neurons are less repurposable and rearrangeable than transistors, it’s another reason why the FLOPs to SYNOPs comparison is not appples-to-apples.
I actually can take arbitrary floats at that precision and multiply or add them in arbitrary ways at the specified rate[1].
And? DL systems just use those floats to simulate large NNs, and a good chunk of recent progress has resulted from moving down to lower precision from 32b to 16b to 8b and soon 4b or lower, chasing after the brain’s carefully tuned use of highly energy efficient low precision ops.
Intelligence requires exploring a circuit space, simulating circuits. The brain is exactly the kind of hardware you need to do that with extreme efficiency given various practical physical constraints.
GPUs/accelerators can match the brain in raw low precision op/s useful for simulating NNs (circuits), but use far more energy to do so and more importantly are also extremely limited by memory bandwidth which results in an extremely poor 100:1 or even 1000:1 alu:mem ratio, which prevents them from accelerating anything other than matrix matrix multiplication, rather than the far more useful sparse vector matrix multiplication.
Yes, this is somewhat of a separate / higher-level consideration, but if neurons are less repurposable and rearrangeable than transistors,
This is just nonsense. A GPU can not rearrange its internal circuitry to change precision or reallocate operations. A brain can and does by shrinking/expanding synapses, growing new ones, etc.
This is just nonsense. A GPU can not rearrange its internal circuitry to change precision or reallocate operations. A brain can and does by shrinking/expanding synapses, growing new ones, etc.
Give me some floats, I can make a GPU do matrix multiplication, or sparse matrix multiplication, or many other kind of computations at a variety of precisions across the entire domain of floats at that precision.
A brain is (maybe) carrying out a computation which is modeled by a particular bunch of sparse matrix multiplications, in which the programmer has much less control over the inputs, domain, and structure of the computation.
The fact that some process (maybe) irreducibly requires some number of FLOPs to simulate faithfully is different from that process being isomorphic to that computation itself.
Intelligence requires exploring and simulating a large circuit space—ie by using something like gradient descent on neural networks. You can use a GPU to do that inefficiently or you can create custom nanotech analog hardware like the brain.
The brain emulates circuits, and current AI systems on GPUs simulate circuits inspired by the brain’s emulation.
Intelligence requires exploring and simulating a large circuit space—ie by using something like gradient descent on neural networks.
I don’t think neuroplasticity is equivalent to architecting and then doing gradient descent on an artificial neural network. That process is more analogous to billions of years of evolution, which encoded most of the “circuit exploration” process in DNA. In the brain, some of the weights and even connections are adjusted at “runtime”, but the rules for making those connections are necessarily encoded in DNA.
(Also, I flatly don’t buy that any of this is required for intelligence.)
The first step in reducing confusion is to look at what a synaptic spike does. It is the equivalent of—in terms of computational power—an ANN ‘synaptic spike’, which is a memory read of a weight, a low precision MAC (multiply accumulate), and a weight memory write (various neurotransmitter plasticity mechanisms). Some synapses probably do more than this—nonlinear decoding of spike times for example, but that’s a start. This is all implemented in a pretty minimal size looking device. The memory read/write is local, but it also needs to act as an amplifier to some extent, to reduce noise and push the signal farther down the wire. An analog multiplier uses many charge carriers to get a reasonable SNR ratio, which compares to all the charge carries across a digital multiplier including interconnect.
So with that background you can apply the landauer analysis to get base bit energy, then estimate the analog MAC energy cost (or equivalent digital MAC, but the digital MAC is much larger so there are size/energy/speed tradeoffs), and finally consider the probably dominate interconnect cost. I estimate the interconnect cost alone at perhaps a watt.
A complementary approach is to compare to projected upcoming end-of CMOS scaling tech as used in research accelerator designs and see that you end up getting similar numbers (also discussed in the article).
The brain, like current CMOS tech, is completely irreversible. Reversible computation is possible in theory but is exotic like quantum computation requiring near zero temp and may not be practical at scale on a noisy environment like the earth, for the reasons outlined by Cavin/Zhirnov here and discussed in a theoretical cellular model by Tiata here—basically fully reversible computers rapidly forget everything as noise accumulates. Irreversible computers like brains and GPUs erase all thermal noise at every step, and pay the hot iron price to do so.
This does not explain how thousands of neurotransmitter molecules impinging on a neuron and thousands of ions flooding into and out of cell membranes, all irreversible operations, in order to transmit one spike, could possibly be within one OOM of the thermodynamic limit on efficiency for a cognitive system (running at that temperature).
See my reply here which attempts to answer this. In short, if you accept that the synapse is doing the equivalent of all the operations involving a weight in a deep learning system (storing the weight, momentum gradient etc in minimal viable precision, multiplier for forward back and weight update, etc), then the answer is a more straightforward derivation from the requirements. If you are convinced that the synapse is only doing the equivalent of a single bit AND operation, then obviously you will reach the conclusion that it is many OOM wasteful, but tis easy to demolish any notion that is merely doing something so simple.[1]
There are of course many types of synapses which perform somewhat different computations and thus have different configurations, sizes, energy costs, etc. I am mostly referring to the energy/compute dominate cortical pyramidal synapses.
Nothing about any of those claims explains why the 10,000-fold redundancy of neurotransmitter molecules and ions being pumped in and out of the system is necessary for doing the alleged complicated stuff.
Is your point that the amount of neurotransmitter is precisely meaningful (so that spending some energy/heat on pumping one additional ion is doing on the order of a bit of “meaningful work”)?
I’m not sure what you mean precisely by “precisely meaningful”, but I do believe we actually know enough about how neural circuits and synapses work[1] such that we have some confidence that they must be doing something similar to their artificial analogs in DL systems.
So this minimally requires:
storage for a K-bit connection weight in memory
(some synapses) nonlinear decoding of B-bit incoming neural spike signal (timing based)
analog ‘multiplication’[2] of incoming B-bit neural signal by K-bit weight
weight update from local backpropagating hebbian/gradient signal or equivalent
We know from DL that K and B do not need to be very large, but the optimal are well above 1-bit, and more importantly the long term weight storage (equivalent of gradient EMA/momentum) drives most of the precision demand, as it needs to accumulate many noisy measurements over time. From DL it looks like you want around 8-bit at least for long-term weight param storage, even if you can sample down to 4-bit or a bit lower for forward/backwards passes.
So that just takes a certain amount of work, and if you map out the minimal digital circuits in a maximally efficient hypothetical single-electron tile technology you really do get something on order 1e5 minimal 1eV units or more[3]. Synapses are also efficient in the sense that they grow/shrink to physically represent larger/smaller logical weights using more/less resources in the optimal fashion.
I have also argued on the other side of this—there are some DL researchers who think the brain does many many OOM more computation than it would seem, but we can rule that out with the same analysis.
To those with the relevant background knowledge in DL, accelerator designs, and the relevant neuroscience.
The actual synaptic operations are non-linear and more complex, but do something like the equivalent work of analog multiplication, and can’t be doing dramatically more or less.
This is not easy to do either and requires knowledge of the limits of electronics.
Thanks! (I’m having a hard time following your argument as a whole, and I’m also not trying very hard / being lazy / not checking the numbers; but I appreciate your answers, and they’re at least fleshing out some kind of model that feels useful to me. )
From the synapses section:
I don’t understand why this follows, and suspect it is false. Most of these synaptic operations are probably not “correct” multiply-adds of any precision—they’re actually more random, noisier functions that are approximated or modeled by analog MACs with particular input ranges.
And even if each synaptic operation really is doing the equivalent of an arbitrary analog MAC computation, that doesn’t mean that these operations are working together to do any kind of larger computation or cognition in anywhere close to the most efficient possible way.
Similar to how you can prune and distill large artificial models without changing their behavior much, I expect you could get rid of many neurons in the brain without changing the actual computation that it performs much or at all.
It seems like you’re modeling the brain as performing some particular exact computation where every bit of noise is counted as useful work. The fact that the brain may be within a couple OOM of fundamental thermodynamic limits of computing exactly what it happens to compute seems not very meaningful as a measure of the fundamental limit of useful computation or cognition possible given particular size and energy specifications.
I made a post which may help explain the analogy between spikes and multiply-accumulate operations.
I could make the exact same argument about some grad student’s first DL experiment running on a GPU, on multiple levels.
I also suspect you could get rid of many neurons in their DL model without changing the computation, I suspect they aren’t working together to do any kind of larger cognition in anywhere closer to the most efficient possible way.
It’s also likely they may not even know how to use the tensorcores efficiently, and even if they did the tensorcores waste most of their compute multiplying by zeros or near zeroes, regardless of how skilled/knowledge-able the DL practitioner.
And yet knowing all this, we still count flops in the obvious way, as counting “hypothetical fully utilized fllops” is not an easy useful quantity to measure discuss and compare.
Utilization of the compute resources is a higher level software/architecture efficiency consideration, not a hardware efficiency measure.
Given a CPU capable of a specified number of FLOPs at a specified precision, I actually can take arbitrary floats at that precision and multiply or add them in arbitrary ways at the specified rate[1].
Not so for brains, for at least a couple of reasons:
An individual neuron can’t necessarily perform an arbitrary multiply / add / accumulate operation, at any particular precision. It may be modeled by an analog MAC of a specified precision over some input range.
The software / architecture point above. For many artificial computations we care about, we can apply both micro (e.g. assembly code optimization) and macro (e.g. using a non-quadratic algorithm for matrix multiplication) optimization to get pretty close to the theoretical limit of efficiency. Maybe the brain is already doing the analog version of these kinds optimizations in some cases. Yes, this is somewhat of a separate / higher-level consideration, but if neurons are less repurposable and rearrangeable than transistors, it’s another reason why the FLOPs to SYNOPs comparison is not appples-to-apples.
modulo some concerns about I/O, generation, checking, and CPU manufacturers inflating their benchmark numbers
And? DL systems just use those floats to simulate large NNs, and a good chunk of recent progress has resulted from moving down to lower precision from 32b to 16b to 8b and soon 4b or lower, chasing after the brain’s carefully tuned use of highly energy efficient low precision ops.
Intelligence requires exploring a circuit space, simulating circuits. The brain is exactly the kind of hardware you need to do that with extreme efficiency given various practical physical constraints.
GPUs/accelerators can match the brain in raw low precision op/s useful for simulating NNs (circuits), but use far more energy to do so and more importantly are also extremely limited by memory bandwidth which results in an extremely poor 100:1 or even 1000:1 alu:mem ratio, which prevents them from accelerating anything other than matrix matrix multiplication, rather than the far more useful sparse vector matrix multiplication.
This is just nonsense. A GPU can not rearrange its internal circuitry to change precision or reallocate operations. A brain can and does by shrinking/expanding synapses, growing new ones, etc.
Give me some floats, I can make a GPU do matrix multiplication, or sparse matrix multiplication, or many other kind of computations at a variety of precisions across the entire domain of floats at that precision.
A brain is (maybe) carrying out a computation which is modeled by a particular bunch of sparse matrix multiplications, in which the programmer has much less control over the inputs, domain, and structure of the computation.
The fact that some process (maybe) irreducibly requires some number of FLOPs to simulate faithfully is different from that process being isomorphic to that computation itself.
Intelligence requires exploring and simulating a large circuit space—ie by using something like gradient descent on neural networks. You can use a GPU to do that inefficiently or you can create custom nanotech analog hardware like the brain.
The brain emulates circuits, and current AI systems on GPUs simulate circuits inspired by the brain’s emulation.
I don’t think neuroplasticity is equivalent to architecting and then doing gradient descent on an artificial neural network. That process is more analogous to billions of years of evolution, which encoded most of the “circuit exploration” process in DNA. In the brain, some of the weights and even connections are adjusted at “runtime”, but the rules for making those connections are necessarily encoded in DNA.
(Also, I flatly don’t buy that any of this is required for intelligence.)