I’m confused at how somebody ends up calculating that a brain—where each synaptic spike is transmitted by ~10,000 neurotransmitter molecules (according to a quick online check), which then get pumped back out of the membrane and taken back up by the synapse; and the impulse is then shepherded along cellular channels via thousands of ions flooding through a membrane to depolarize it and then getting pumped back out using ATP, all of which are thermodynamically irreversible operations individually—could possibly be within three orders of magnitude of max thermodynamic efficiency at 300 Kelvin. I have skimmed “Brain Efficiency” though not checked any numbers, and not seen anything inside it which seems to address this sanity check.
The first step in reducing confusion is to look at what a synaptic spike does. It is the equivalent of—in terms of computational power—an ANN ‘synaptic spike’, which is a memory read of a weight, a low precision MAC (multiply accumulate), and a weight memory write (various neurotransmitter plasticity mechanisms). Some synapses probably do more than this—nonlinear decoding of spike times for example, but that’s a start. This is all implemented in a pretty minimal size looking device. The memory read/write is local, but it also needs to act as an amplifier to some extent, to reduce noise and push the signal farther down the wire. An analog multiplier uses many charge carriers to get a reasonable SNR ratio, which compares to all the charge carries across a digital multiplier including interconnect.
So with that background you can apply the landauer analysis to get base bit energy, then estimate the analog MAC energy cost (or equivalent digital MAC, but the digital MAC is much larger so there are size/energy/speed tradeoffs), and finally consider the probably dominate interconnect cost. I estimate the interconnect cost alone at perhaps a watt.
A complementary approach is to compare to projected upcoming end-of CMOS scaling tech as used in research accelerator designs and see that you end up getting similar numbers (also discussed in the article).
The brain, like current CMOS tech, is completely irreversible. Reversible computation is possible in theory but is exotic like quantum computation requiring near zero temp and may not be practical at scale on a noisy environment like the earth, for the reasons outlined by Cavin/Zhirnov here and discussed in a theoretical cellular model by Tiata here—basically fully reversible computers rapidly forget everything as noise accumulates. Irreversible computers like brains and GPUs erase all thermal noise at every step, and pay the hot iron price to do so.
This does not explain how thousands of neurotransmitter molecules impinging on a neuron and thousands of ions flooding into and out of cell membranes, all irreversible operations, in order to transmit one spike, could possibly be within one OOM of the thermodynamic limit on efficiency for a cognitive system (running at that temperature).
See my reply here which attempts to answer this. In short, if you accept that the synapse is doing the equivalent of all the operations involving a weight in a deep learning system (storing the weight, momentum gradient etc in minimal viable precision, multiplier for forward back and weight update, etc), then the answer is a more straightforward derivation from the requirements. If you are convinced that the synapse is only doing the equivalent of a single bit AND operation, then obviously you will reach the conclusion that it is many OOM wasteful, but tis easy to demolish any notion that is merely doing something so simple.[1]
There are of course many types of synapses which perform somewhat different computations and thus have different configurations, sizes, energy costs, etc. I am mostly referring to the energy/compute dominate cortical pyramidal synapses.
Nothing about any of those claims explains why the 10,000-fold redundancy of neurotransmitter molecules and ions being pumped in and out of the system is necessary for doing the alleged complicated stuff.
Is your point that the amount of neurotransmitter is precisely meaningful (so that spending some energy/heat on pumping one additional ion is doing on the order of a bit of “meaningful work”)?
I’m not sure what you mean precisely by “precisely meaningful”, but I do believe we actually know enough about how neural circuits and synapses work[1] such that we have some confidence that they must be doing something similar to their artificial analogs in DL systems.
So this minimally requires:
storage for a K-bit connection weight in memory
(some synapses) nonlinear decoding of B-bit incoming neural spike signal (timing based)
analog ‘multiplication’[2] of incoming B-bit neural signal by K-bit weight
weight update from local backpropagating hebbian/gradient signal or equivalent
We know from DL that K and B do not need to be very large, but the optimal are well above 1-bit, and more importantly the long term weight storage (equivalent of gradient EMA/momentum) drives most of the precision demand, as it needs to accumulate many noisy measurements over time. From DL it looks like you want around 8-bit at least for long-term weight param storage, even if you can sample down to 4-bit or a bit lower for forward/backwards passes.
So that just takes a certain amount of work, and if you map out the minimal digital circuits in a maximally efficient hypothetical single-electron tile technology you really do get something on order 1e5 minimal 1eV units or more[3]. Synapses are also efficient in the sense that they grow/shrink to physically represent larger/smaller logical weights using more/less resources in the optimal fashion.
I have also argued on the other side of this—there are some DL researchers who think the brain does many many OOM more computation than it would seem, but we can rule that out with the same analysis.
The actual synaptic operations are non-linear and more complex, but do something like the equivalent work of analog multiplication, and can’t be doing dramatically more or less.
Thanks! (I’m having a hard time following your argument as a whole, and I’m also not trying very hard / being lazy / not checking the numbers; but I appreciate your answers, and they’re at least fleshing out some kind of model that feels useful to me. )
Thus the brain is likely doing on order 1014 to 1015 low-medium precision multiply-adds per second.
I don’t understand why this follows, and suspect it is false. Most of these synaptic operations are probably not “correct” multiply-adds of any precision—they’re actually more random, noisier functions that are approximated or modeled by analog MACs with particular input ranges.
And even if each synaptic operation really is doing the equivalent of an arbitrary analog MAC computation, that doesn’t mean that these operations are working together to do any kind of larger computation or cognition in anywhere close to the most efficient possible way.
Similar to how you can prune and distill large artificial models without changing their behavior much, I expect you could get rid of many neurons in the brain without changing the actual computation that it performs much or at all.
It seems like you’re modeling the brain as performing some particular exact computation where every bit of noise is counted as useful work. The fact that the brain may be within a couple OOM of fundamental thermodynamic limits of computing exactly what it happens to compute seems not very meaningful as a measure of the fundamental limit of useful computation or cognition possible given particular size and energy specifications.
I could make the exact same argument about some grad student’s first DL experiment running on a GPU, on multiple levels.
I also suspect you could get rid of many neurons in their DL model without changing the computation, I suspect they aren’t working together to do any kind of larger cognition in anywhere closer to the most efficient possible way.
It’s also likely they may not even know how to use the tensorcores efficiently, and even if they did the tensorcores waste most of their compute multiplying by zeros or near zeroes, regardless of how skilled/knowledge-able the DL practitioner.
And yet knowing all this, we still count flops in the obvious way, as counting “hypothetical fully utilized fllops” is not an easy useful quantity to measure discuss and compare.
Utilization of the compute resources is a higher level software/architecture efficiency consideration, not a hardware efficiency measure.
And yet knowing all this, we still count flops in the obvious way, as counting “hypothetical fully utilized fllops” is not an easy useful quantity to measure discuss and compare.
Given a CPU capable of a specified number of FLOPs at a specified precision, I actually can take arbitrary floats at that precision and multiply or add them in arbitrary ways at the specified rate[1].
Not so for brains, for at least a couple of reasons:
An individual neuron can’t necessarily perform an arbitrary multiply / add / accumulate operation, at any particular precision. It may be modeled by an analog MAC of a specified precision over some input range.
The software / architecture point above. For many artificial computations we care about, we can apply both micro (e.g. assembly code optimization) and macro (e.g. using a non-quadratic algorithm for matrix multiplication) optimization to get pretty close to the theoretical limit of efficiency. Maybe the brain is already doing the analog version of these kinds optimizations in some cases. Yes, this is somewhat of a separate / higher-level consideration, but if neurons are less repurposable and rearrangeable than transistors, it’s another reason why the FLOPs to SYNOPs comparison is not appples-to-apples.
I actually can take arbitrary floats at that precision and multiply or add them in arbitrary ways at the specified rate[1].
And? DL systems just use those floats to simulate large NNs, and a good chunk of recent progress has resulted from moving down to lower precision from 32b to 16b to 8b and soon 4b or lower, chasing after the brain’s carefully tuned use of highly energy efficient low precision ops.
Intelligence requires exploring a circuit space, simulating circuits. The brain is exactly the kind of hardware you need to do that with extreme efficiency given various practical physical constraints.
GPUs/accelerators can match the brain in raw low precision op/s useful for simulating NNs (circuits), but use far more energy to do so and more importantly are also extremely limited by memory bandwidth which results in an extremely poor 100:1 or even 1000:1 alu:mem ratio, which prevents them from accelerating anything other than matrix matrix multiplication, rather than the far more useful sparse vector matrix multiplication.
Yes, this is somewhat of a separate / higher-level consideration, but if neurons are less repurposable and rearrangeable than transistors,
This is just nonsense. A GPU can not rearrange its internal circuitry to change precision or reallocate operations. A brain can and does by shrinking/expanding synapses, growing new ones, etc.
This is just nonsense. A GPU can not rearrange its internal circuitry to change precision or reallocate operations. A brain can and does by shrinking/expanding synapses, growing new ones, etc.
Give me some floats, I can make a GPU do matrix multiplication, or sparse matrix multiplication, or many other kind of computations at a variety of precisions across the entire domain of floats at that precision.
A brain is (maybe) carrying out a computation which is modeled by a particular bunch of sparse matrix multiplications, in which the programmer has much less control over the inputs, domain, and structure of the computation.
The fact that some process (maybe) irreducibly requires some number of FLOPs to simulate faithfully is different from that process being isomorphic to that computation itself.
Intelligence requires exploring and simulating a large circuit space—ie by using something like gradient descent on neural networks. You can use a GPU to do that inefficiently or you can create custom nanotech analog hardware like the brain.
The brain emulates circuits, and current AI systems on GPUs simulate circuits inspired by the brain’s emulation.
Intelligence requires exploring and simulating a large circuit space—ie by using something like gradient descent on neural networks.
I don’t think neuroplasticity is equivalent to architecting and then doing gradient descent on an artificial neural network. That process is more analogous to billions of years of evolution, which encoded most of the “circuit exploration” process in DNA. In the brain, some of the weights and even connections are adjusted at “runtime”, but the rules for making those connections are necessarily encoded in DNA.
(Also, I flatly don’t buy that any of this is required for intelligence.)
Further item of “these elaborate calculations seem to arrive at conclusions that can’t possibly be true”—besides the brain allegedly being close to the border of thermodynamic efficiency, despite visibly using tens of thousands of redundant physical ops in terms of sheer number of ions and neurotransmitters pumped; the same calculations claim that modern GPUs are approaching brain efficiency, the Limit of the Possible, so presumably at the Limit of the Possible themselves.
This source claims 100x energy efficiency from substituting some basic physical analog operations for multiply-accumulate, instead of digital transistor operations about them, even if you otherwise use actual real-world physical hardware. Sounds right to me; it would make no sense for such a vastly redundant digital computation of such a simple physical quantity to be anywhere near the borders of efficiency! https://spectrum.ieee.org/analog-ai
I’m not sure why you believe “the same calculations claim that modern GPUs are approaching brain efficiency, the Limit of the Possible”. GPUs require at least on order ~1e-11J to fetch a single 8-bit value from GDDRX RAM (1e-19 J/b/nm (interconnect wire energy) * 1cm * 8), so around ~1KW or 100x the brain for 1e14 of those per second, not even including flop energy cost (the brain doesn’t have much more efficient wires, it just minimizes that entire cost by moving the memory synapses/weights as close as possible to the compute .. by merging them). I do claim that Moore’s Law is ending and not delivering much farther increase in CMOS energy efficiency (and essentially zero increase in wire energy efficiency), but GPUs are far from the optimal use of CMOS towards running NNs.
This source claims 100x energy efficiency from substituting some basic physical analog operations for multiply-accumulate,
That sounds about right, and Indeed I roughly estimate the minimal energy for 8 bit analog MAC at the end of the synapse section, with 4 refs examples from the research lit:
We can also compare the minimal energy prediction of 10−15J/op for 8-bit equivalent analog multiply-add to the known and predicted values for upcoming efficient analog accelerators, which mostly have energy efficiency in the 10−14J/op range[1][2][3][4] for < 8 bit, with the higher reported values around 10−15J/op similar to the brain estimate here, but only for < 4-bit precision[5]. Analog devices can not be shrunk down to few nm sizes without sacrificing SNR and precision; their minimal size is determined by the need for a large number of carriers on order 2c∗β for equivalent bit precision β, and c ~ 2, as discussed earlier.
The more complicated part of comparing these is how/whether to include the cost of reading/writing a synapse/weight value from RAM across a long wire, which is required for full equivalence to the brain. The brain as a true RNN is doing Vector Matrix multiplication, whereas GPUs/Accelerators instead do Matrix Matrix multiplication to amortize the cost of expensive RAM fetches. VM mult can simulate MM mult at no extra cost, but MM mult can only simulate VM mult at huge inefficiency proportional to the minimal matrix size (determined by ALU/RAM ratio, ~1000:1 now at low precision). The full neuromorphic or PIM approach instead moves the RAM next to the processing elements, and is naturally more suited to VM mult.
Chen, Jia, et al. “Multiply accumulate operations in memristor crossbar arrays for analog computing.” Journal of Semiconductors 42.1 (2021): 013104. gs-link
Mahmoodi, M. Reza, and Dmitri Strukov. “Breaking POps/J barrier with analog multiplier circuits based on nonvolatile memories.” Proceedings of the International Symposium on Low Power Electronics and Design. 2018. gs-link
Okay, if you’re not saying GPUs are getting around as efficient as the human brain, without much more efficiency to be eeked out, then I straightforwardly misunderstood that part.
Could you elaborate on your last paragraph about matrix -matrix multiplication versus vector matrix multiplication.
What does this have to do with the RAM being next to the processing units?
(As a general note, I think it would be useful for people trying to follow along if you would explain some of the technical terms you are using. Not everybody is a world-expert in GPU-design!
E.g. PIM, CMOS, MAC etc )
Matrix Matrix Mult of square matrices dim N uses ~2N3 ALU ops and ~3N2 MEM ops, so it has an arithmetic intensity of ~N (ALU:MEM ratio).
Vector Matrix Mult of dim N uses ~2N2 ALU and ~3N2 MEM, for an arithmetic intensity of ~1.
A GPU has an ALU:MEM ratio of about 1000:1 (for lower precision tensorcore ALU), so it is inefficient at vector matrix mult by a factor of about 1000 vs matrix matrix mult. The high ALU:MEM ratio is a natural result of the relative wire lengths: very short wire distances to shuffle values between FP units in a tensorcore vs very long wire distances to reach a value in off chip RAM.
The GPU needs numbers to be stored in registers inside the GPU before it can do operations on them. A memory operation (what Jacob calls MEM) is when you load a particular value from memory into a register. An arithmetic operation is when you do an elementary arithmetic operation such as addition or multiplication on two values that have already been loaded into registers. These are done by the arithmetic-logic unit (ALU) of the processor so are called ALU ops.
Because a matrix multiplication of two N×N matrices only involves 2N2 distinct floating point numbers as input, and writing the result back into memory is going to cost you another N2 memory operations, the total MEM ops cost of a matrix multiplication of two matrices of size N×N is 3N2. In contrast, if you’re using the naive matrix multiplication algorithm, computing each entry in the output matrix takes you N additions and N multiplications, so you end up with 2N⋅N2=2N3 ALU ops needed.
The ALU:MEM ratio is important because if your computation is imbalanced relative to what is supported by your hardware then you’ll end up being bottlenecked by one of them and you’ll be unable to exploit the surplus resources you have on the other side. For instance, if you’re working with a bizarre GPU that has a 1:1 ALU:MEM ratio, whenever you’re only using the hardware to do matrix multiplications you’ll have enormous amounts of MEM ops capacity sitting idle because you don’t have the capacity to be utilizing them.
So true 8-bit equivalent analog multiplication requires about 100k carriers/switches
This just seems utterly wack. Having any physical equivalent of an analog multiplication fundamentally requires 100,000 times the thermodynamic energy to erase 1 bit? And “analog multiplication down to two decimal places” is the operation that is purportedly being carried out almost as efficiently as physically possible by… an axon terminal with a handful of synaptic vesicles dumping 10,000 neurotransmitter molecules to flood around a dendritic terminal (molecules which will later need to be irreversibly pumped back out), which in turn depolarizes and starts flooding thousands of ions into a cell membrane (to be later pumped out) in order to transmit the impulse at 1m/s? That’s the most thermodynamically efficient a physical cognitive system can possibly be? This is approximately the most efficient possible way to turn all those bit erasures into thought?
This sounds like physical nonsense that fails a basic sanity check. What am I missing?
And “analog multiplication down to two decimal places” is the operation that is purportedly being carried out almost as efficiently as physically possible by
I am not certain it is being carried “almost as efficiently as physically possible”, assuming you mean thermodynamic efficiency (even accepting you meant thermodynamic efficiency only for irreversible computation), my belief is more that the brain and its synaptic elements are reasonably efficient in a pareto tradeoff sense.
But any discussion around efficiency must make some starting assumptions about what computations the system may be performing. We now have a reasonable amount of direct and indirect evidence—direct evidence from neuroscience, indirect evidence from DL—that allows us some confidence that the brain is conventional (irreversible, non quantum), and is basically very similar to an advanced low power DL accelerator built out of nanotech replicators. (and the clear obvious trend in hardware design is towards the brain)
So starting with that frame ..
Having any physical equivalent of an analog multiplication fundamentally requires 100,000 times the thermodynamic energy to erase 1 bit?
A synaptic op is the equivalent of reading an 8b-ish weight from memory, ‘multiplying’ by the incoming spike value, propagating the output down the wire, updating neurotransmitter receptors (which store not just the equivalent of the weight, but the bayesian distribution params on the weight, equivalent to gradient momentum etc), back-propagating spike (in some scenarios), spike decoding (for nonlinear spike timing codes), etc.
It just actually does a fair amount of work, and if you actually query the research literature to see how many transistors that would take it is something like 10k to 100k or more, each of which minimally uses 1eV per op * 10 for interconnect, according to the best micromodels of circuit limits (cavin/Zhirnov).
The analog multiplier and gear is very efficient (especially in space) for low SNRs, but it scales poorly (exponentially) with bit precision (equivalent SNR). From the last papers I recall 8b is the crossover point where digital wins in energy and perhaps size. Below that analog dominates. There are numerous startups working on analog hardware to replace GPUs for low bit precision multipliers, chasing the brain, but its extremely difficult and IMHO may. not be worth it without nanotech.
in order to transmit the impulse at 1m/s?
The brain only runs at 100hz and the axon conduction velocity is optimized just so that every brain region can connect to distal regions without significant delay (delay on order of a millisecond or so).
So the real question is then just why 100hz—which I also answer in brain efficiency. If you have a budget of 10W you can spend that running a very small NN very fast or a very large NN at lower speeds, and the latter seems more useful for biology. Digital minds obviously can spend the energy cost to run at fanastic speeds—and GPT4 was only possible because its NN can run vaguely ~10000x faster than the brain (for training).
I’ll end with an interesting quote from Hinton[1]:
The separation of software from hardware is one of the foundations of Computer Science and it
has many benefits. It makes it possible to study the properties of programs without worrying about
electrical engineering. It makes it possible to write a program once and copy it to millions of
computers. If, however, we are willing to abandon immortality it should be possible to achieve huge
savings in the energy required to perform a computation and in the cost of fabricating the hardware
that executes the computation. We can allow large and unknown variations in the connectivity and
non-linearities of different instances of hardware that are intended to perform the same task and
rely on a learning procedure to discover parameter values that make effective use of the unknown
properties of each particular instance of the hardware. These parameter values are only useful for that
specific hardware instance, so the computation they perform is mortal: it dies with the hardware.
I think the quoted claim is actually straightforwardly true? Or at least, it’s not really surprising that actual precise 8 bit analog multiplication really does require a lot more energy than the energy required to erase one bit.
I think the real problem with the whole section is that it conflates the amount of computation required to model synaptic operation with the amount of computation each synapse actually performs.
These are actually wildly different types of things, and I think the only thing it is justifiable to conclude from this analysis is that (maybe, if the rest of it is correct) it is not possible to simulate the operation of a human brain at synapse granularity, using much less than 10W and 1000 cm^3. Which is an interesting fact if true, but doesn’t seem to have much bearing on the question of whether the brain is close to an optimal substrate for carrying out the abstract computation of human cognition.
(I expanded a little on the point about modeling a computation vs. the computation itself in an earlier sibling reply.)
Or at least, it’s not really surprising that actual precise 8 bit analog multiplication
I’m not sure what you mean by “precise 8 bit analog multiplication”, as analog is not precise in the way digital is. When I say 8-bit analog equivalent, I am talking about an analog operation that has SNR equivalent to quantized 8-bit digital, which is near the maximum useful range for analog multiplier devices, and near the upper range of estimates of synaptic precision.
I was actually imagining some kind of analogue to an 8 bit Analog-to-digital converter. Or maybe an op amp? My analog circuits knowledge is very rusty.
But anyway, if you draw up a model of some synapses as an analog circuit with actual analog components, that actually illustrates one of my main objections pretty nicely: neurons won’t actually meet the same performance specifications of that circuit, even if they behave like or are modeled by those circuits for specific input ranges and a narrow operating regime.
An actual analog circuit has to meet precise performance specifications within a specified operating domain, whether it is comprised of an 8 bit or 2 bit ADC, a high or low quality op amp, etc.
If you draw up a circuit made out of neurons, the performance characteristics and specifications it meets will probably be much more lax. If you relax specifications for a real analog circuit in the same way, you can probably make the circuit out of much cheaper and lower-energy component pieces.
An actual CMOS analog circuit only has to meet those precision performance specifications because it is a design which must be fabricated and reproduced with precision over and over.
The brain doesn’t have that constraint, so it can to some extent learn to exploit the nuances of each specific subcircuit or device. This is almost obviously superior in terms of low level circuit noise tolerance and space and energy efficiency, and is seen by some as the ultimate endpoint of Moore’s Law—see hinton’s Forward Forward section 8 I quoted here
Regardless if you see a neurotransmitter synapse system that is using 10k carriers or whatever flooding through variadic memory-like protein gates such that deep simulations of the system indicate it is doing something similar-ish to analog multiplication with SNR equivalent to 7-bits or whatnot, and you have a bunch of other neuroscience and DL experiments justifying that interpretation, then that is probably what it is doing.
It is completely irrelevant whether it’s a ‘proper’ analog multiplication that would meet precise performance specifications in a mass produced CMOS device. All that matters here is its equivalent computational capability.
An actual CMOS analog circuit only has to meet those precision performance specifications because it is a design which must be fabricated and reproduced with precision over and over.
Mass production is one reason, but another reason this distinction actually is important is that they are performance characteristics of the whole system, not its constituent pieces. For both analog and digital circuits, these performance characteristics have very precise meanings.
Let’s consider flop/s for digitial circuits.
If I can do 1M flop/s, that means roughly, every second, you can give me 2 million floats, and I can multiply them together pairwise and give you 1 million results, 1 second later. I can do this over and over again every second, and the values can be arbitrarily distributed over the entire domain of floating point numbers at a particular precision.[1]
“Synaptic computations” in the brain, as you describe them, do not have any of these properties. The fact that 10^15 of them happen per second is not equivalent or comparable to 10^15 flop/s, because it is not a performance characteristic of the system as a whole.
By analogy: suppose you have some gas particles in a container, and you’d like to simulate their positions. Maybe the simulation requires 10^15 flop/s to simulate in real time, and there is provably no more efficient way to run your simulation.
Does that mean the particles themselves are doing 10^15 flop/s? No!
Saying the brain does “10^15 synaptic operations per second” is a bit like saying the particles in the gas are doing “10^15 particle operations per second”.
The fact that, in the case of the brain, the operations themselves are performing some kind of useful work that looks like a multiply-add, and that this is maybe within an OOM of some fundamental efficiency limit, doesn’t mean you can coerce the types arbitrarily to say the the brain itself is efficient as a whole.
As a less vacuous analogy, you could do a bunch of analysis on an individual CMOS gate from the 1980s, and find, perhaps, that it is “near the limit of thermodynamic efficiency” in the sense that every microjoule of energy it uses is required to make it actually work. Cooling + overclocking might let you push things a bit, but you’ll never be able to match the performance of re-designing the underlying transistors entirely at a smaller process (which often involves a LOT more than just shrinking individual transistors).
It is completely irrelevant whether it’s a ‘proper’ analog multiplication that would meet precise performance specifications in a mass produced CMOS device. All that matters here is its equivalent computational capability.
Indeed, Brains and digital circuits have completely different computational capabilities and performance characteristics. That’s kind of the whole point.
If I do this with a CPU, I might have full control over which pairs are multiplied. If I have an ASIC, the pair indices might be fixed. If I have an FPGA, they might be fixed until I reprogram it.
(If I do this with a CPU, I might have full control over which pairs are multiplied. If I have an ASIC, the pair indices might be fixed. If I have an FPGA, they might be fixed until I reprogram it.)
The only advantage of a CPU/GPU over an ASIC is that the CPU/GPU is programmable after device creation. If you know what calculation you want to perform you use an ASIC and avoid the enormous inefficiency of the CPU/GPU simulating the actual circuit you want to use. An FPGA is somewhere in between.
The brain uses active rewiring (and synapse growth/shrinkage) to physically adapt the hardware, which has the flexibility of an FPGA for the purposes of deep learning, but the efficiency of an ASIC.
As a less vacuous analogy, you could do a bunch of analysis on an individual CMOS gate from the 1980s, and find, perhaps, that it is “near the limit of thermodynamic efficiency”
Or you could make the same argument about a pile of rocks, or a GPU as I noticed earlier. The entire idea of computation is a map territory enforcement, it always requires a mapping between a logical computation and physics.
If you simply assume—as you do—that the brain isn’t computing anything useful (as equivalent to deep learning operations as I believe is overwhelming supported by the evidence), then you can always claim that, but I see no reason to pay attention whatsoever. I suspect you simply haven’t spent the requisite many thousands of hours reading the right DL and neuroscience.
The only advantage of a CPU/GPU over an ASIC is that the CPU/GPU is programmable after device creation. If you know what calculation you want to perform you use an ASIC and avoid the enormous inefficiency of the CPU/GPU simulating the actual circuit you want to use
This has a kernel of truth but it is misleading. There are plenty of algorithms that don’t naturally map to circuits, because a step of an algorithm in a circuit costs space, whereas a step of an algorithm in a programmable computer costs only those bits required to encode the task. The inefficiency of dynamic decode can be paid for with large enough algorithms. This is most obvious when considering large tasks on very small machines.
It is true that neither GPUs nor CPUs seem particularly pareto optimal for their broad set of tasks, versus a cleverer clean-sheet design, and it is also true that for any given task you could likely specialize a CPU or GPU design for it somewhat easily for at least marginal benefit, but I also think this is not the default way your comment would be interpreted.
If you simply assume—as you do—that the brain isn’t computing anything useful
I do not assume this, but I am claiming that something remains to be shown, namely, that human cognition irreducibly requires any of those 10^15 “synaptic computations”.
Showing such a thing necessarily depends on an understanding of the nature of cognition at the software / algorithms / macro-architecture level. Your original post explicitly disclaims engaging with this question, which is perfectly fine as a matter of topic choice, but you then can’t make any claims which depend on such an understanding.
Absent such an understanding, you can still make apples-to-apples comparisons about overall performance characteristics between digital and biological systems. But those _must_ be grounded in an actual precise performance metric of the system as a whole, if they are to be meaningful at all.
Component-wise analysis is not equivalent to system-wide analysis, even if your component-wise analysis is precise and backed by a bunch of neuroscience results and intuitions from artificial deep learning.
FYI for Jacob and others, I am probably not going to further engage directly with Jacob, as we seem to be mostly talking past each other, and I find his tone (“this is just nonsense”, “completely irrelevant”, “suspect you simply haven’t spent...”, etc.) and style of argument to be tiresome.
I am claiming that something remains to be shown, namely, that human cognition irreducibly requires any of those 10^15 “synaptic computations”.
Obviously it requires some of those computations, but in my ontology the question of how many is clearly a software efficiency question. The fact that an A100 can do ~1e15 low precision op/s (with many caveats/limitations) is a fact about the hardware that tells you nothing about how efficiently any specific A100 may be utilizing that potential. I claim that the brain can likewise do very roughly 1e15 synaptic ops/s, but that questions of utilization of that potential towards intelligence are likewise circuit/software efficiency questions (which I do address in some of my writing, but it is specifically out of scope for this particular question of synaptic hardware.)
Showing such a thing necessarily depends on an understanding of the nature of cognition at the software / algorithms / macro-architecture level. Your original post explicitly disclaims engaging with this question,
My original post does engage with this some in the circuit efficiency section. I draw the circuit/software distinction around architectural prior and learning algorithms (genetic/innate) vs acquired knowledge/skills (cultural).
I find his tone (“this is just nonsense”,
I used that in response to you saying “but if neurons are less repurposable and rearrangeable than transistors,”, which I do believe is actually nonsense, because neural circuits literally dynamically rewire themselves, which allows the flexibility of FPGAs (for circuit learning) combined with the efficiency of ASICs, and transistors are fixed circuits not dynamically modifiable at all.
If I was to try and steelman your position, it is simply that we can not be sure how efficiently the brain utilizes the potential of its supposed synaptic computational power.
To answer that question, I have provided some of the relevant arguments in my past writing, but at this point given the enormous success of DL (which I predicted well in advance) towards AGI and the great extent to which it has reverse engineered the brain, combined with the fact that moore’s law shrinkage is petering out and the brain remains above the efficiency of our best accelerators, entirely shifts the burden on to you to write up detailed analysis/arguments as to how you can explain these facts.
To answer that question, I have provided some of the relevant arguments in my past writing, but at this point given the enormous success of DL (which I predicted well in advance) towards AGI and the great extent to which it has reverse engineered the brain, combined with the fact that moore’s law shrinkage is petering out and the brain remains above the efficiency of our best accelerators, entirely shifts the burden on to you to write up detailed analysis/arguments as to how you can explain these facts.
I think there’s just not that much to explain, here—to me, human-level cognition just doesn’t seem that complicated or impressive in an absolute sense—it is performed by a 10W computer designed by a blind idiot god, after all.
The fact that current DL paradigm methods inspired by its functionality have so far failed to produce artificial cognition of truly comparable quality and efficiency seems more like a failure of those methods rather than a success, at least so far. I don’t expect this trend to continue in the near term (which I think we agree on), and grant you some bayes points for predicting it further in advance.
If I was to try and steelman your position, it is simply that we can not be sure how efficiently the brain utilizes the potential of its supposed synaptic computational power.
I was actually referring to the flexibility and re-arrangability at design time here. Verilog and Cadence can make more flexible use of logic gates and transistors than the brain can make of neurons during a lifetime, and the design space available to circuit designers using these tools is much wider than the one available to evolution.
A sanity check of a counterintuitive claim can be that the argument to the claim implies things that seem unjustifiable or false. It cannot be that the conclusion of the claim itself is unjustifiable or false, except inasmuch as you are willing to deny the possibility to be convinced of that claim by argument at all.
(To avoid confusion, this is not in response to the latter portion of your comment about general cognition.)
In terms of what the actual fundamental thermodynamic limits are, Jacob and I still disagree by a factor of about 50. (Basically, Jacob thinks the dissipated energy needs to be amped up in order to erase a bit with high reliability. I think that while there are some schemes where this is necessary, there are others where it is not and high-reliability erasure is possible with an energy per bit approaching kTlog2. I’m still working through the math to check that I’m actually correct about this, though.)
If you read landauers paper carefully he analyzes 3 sources of noise and kTlog2 is something like the speed of light for bit energy , only achieved at useless 50% error rate and or glacial speeds.
That’s only for the double well model, though, and only for erasing by lifting up one of the wells. I didn’t see a similar theorem proven for a general system. So the crucial question is whether it’s still true in general. I’ll get back to you eventually on that, I’m still working through the math. It may well turn out that you’re right.
I believe the double well model—although it sounds somewhat specific at a glance—is actually a fully universal conceptual category over all relevant computational options for representing a bit.
You can represent a bit with dominoes, in which case the two bistable states are up/down, you can represent it with few electron quantum dots in one of two orbital configs, or larger scale wire charge changes, or perhaps fluid pressure waves, or ..
The exact form doesn’t matter, as a bit always requires a binary classification between two partitions of device microstates, which leads to success probability being some exponential function of switching energy over noise energy. It’s equivalent to a binary classification task for maxwell’s demon.
Summary of the conclusions is that energy on the order of kT should work fine for erasing a bit with high reliability, and the ~50kT claimed by Jacob is not a fully universal limit.
I’m confused at how somebody ends up calculating that a brain—where each synaptic spike is transmitted by ~10,000 neurotransmitter molecules (according to a quick online check), which then get pumped back out of the membrane and taken back up by the synapse; and the impulse is then shepherded along cellular channels via thousands of ions flooding through a membrane to depolarize it and then getting pumped back out using ATP, all of which are thermodynamically irreversible operations individually—could possibly be within three orders of magnitude of max thermodynamic efficiency at 300 Kelvin. I have skimmed “Brain Efficiency” though not checked any numbers, and not seen anything inside it which seems to address this sanity check.
The first step in reducing confusion is to look at what a synaptic spike does. It is the equivalent of—in terms of computational power—an ANN ‘synaptic spike’, which is a memory read of a weight, a low precision MAC (multiply accumulate), and a weight memory write (various neurotransmitter plasticity mechanisms). Some synapses probably do more than this—nonlinear decoding of spike times for example, but that’s a start. This is all implemented in a pretty minimal size looking device. The memory read/write is local, but it also needs to act as an amplifier to some extent, to reduce noise and push the signal farther down the wire. An analog multiplier uses many charge carriers to get a reasonable SNR ratio, which compares to all the charge carries across a digital multiplier including interconnect.
So with that background you can apply the landauer analysis to get base bit energy, then estimate the analog MAC energy cost (or equivalent digital MAC, but the digital MAC is much larger so there are size/energy/speed tradeoffs), and finally consider the probably dominate interconnect cost. I estimate the interconnect cost alone at perhaps a watt.
A complementary approach is to compare to projected upcoming end-of CMOS scaling tech as used in research accelerator designs and see that you end up getting similar numbers (also discussed in the article).
The brain, like current CMOS tech, is completely irreversible. Reversible computation is possible in theory but is exotic like quantum computation requiring near zero temp and may not be practical at scale on a noisy environment like the earth, for the reasons outlined by Cavin/Zhirnov here and discussed in a theoretical cellular model by Tiata here—basically fully reversible computers rapidly forget everything as noise accumulates. Irreversible computers like brains and GPUs erase all thermal noise at every step, and pay the hot iron price to do so.
This does not explain how thousands of neurotransmitter molecules impinging on a neuron and thousands of ions flooding into and out of cell membranes, all irreversible operations, in order to transmit one spike, could possibly be within one OOM of the thermodynamic limit on efficiency for a cognitive system (running at that temperature).
See my reply here which attempts to answer this. In short, if you accept that the synapse is doing the equivalent of all the operations involving a weight in a deep learning system (storing the weight, momentum gradient etc in minimal viable precision, multiplier for forward back and weight update, etc), then the answer is a more straightforward derivation from the requirements. If you are convinced that the synapse is only doing the equivalent of a single bit AND operation, then obviously you will reach the conclusion that it is many OOM wasteful, but tis easy to demolish any notion that is merely doing something so simple.[1]
There are of course many types of synapses which perform somewhat different computations and thus have different configurations, sizes, energy costs, etc. I am mostly referring to the energy/compute dominate cortical pyramidal synapses.
Nothing about any of those claims explains why the 10,000-fold redundancy of neurotransmitter molecules and ions being pumped in and out of the system is necessary for doing the alleged complicated stuff.
Is your point that the amount of neurotransmitter is precisely meaningful (so that spending some energy/heat on pumping one additional ion is doing on the order of a bit of “meaningful work”)?
I’m not sure what you mean precisely by “precisely meaningful”, but I do believe we actually know enough about how neural circuits and synapses work[1] such that we have some confidence that they must be doing something similar to their artificial analogs in DL systems.
So this minimally requires:
storage for a K-bit connection weight in memory
(some synapses) nonlinear decoding of B-bit incoming neural spike signal (timing based)
analog ‘multiplication’[2] of incoming B-bit neural signal by K-bit weight
weight update from local backpropagating hebbian/gradient signal or equivalent
We know from DL that K and B do not need to be very large, but the optimal are well above 1-bit, and more importantly the long term weight storage (equivalent of gradient EMA/momentum) drives most of the precision demand, as it needs to accumulate many noisy measurements over time. From DL it looks like you want around 8-bit at least for long-term weight param storage, even if you can sample down to 4-bit or a bit lower for forward/backwards passes.
So that just takes a certain amount of work, and if you map out the minimal digital circuits in a maximally efficient hypothetical single-electron tile technology you really do get something on order 1e5 minimal 1eV units or more[3]. Synapses are also efficient in the sense that they grow/shrink to physically represent larger/smaller logical weights using more/less resources in the optimal fashion.
I have also argued on the other side of this—there are some DL researchers who think the brain does many many OOM more computation than it would seem, but we can rule that out with the same analysis.
To those with the relevant background knowledge in DL, accelerator designs, and the relevant neuroscience.
The actual synaptic operations are non-linear and more complex, but do something like the equivalent work of analog multiplication, and can’t be doing dramatically more or less.
This is not easy to do either and requires knowledge of the limits of electronics.
Thanks! (I’m having a hard time following your argument as a whole, and I’m also not trying very hard / being lazy / not checking the numbers; but I appreciate your answers, and they’re at least fleshing out some kind of model that feels useful to me. )
From the synapses section:
I don’t understand why this follows, and suspect it is false. Most of these synaptic operations are probably not “correct” multiply-adds of any precision—they’re actually more random, noisier functions that are approximated or modeled by analog MACs with particular input ranges.
And even if each synaptic operation really is doing the equivalent of an arbitrary analog MAC computation, that doesn’t mean that these operations are working together to do any kind of larger computation or cognition in anywhere close to the most efficient possible way.
Similar to how you can prune and distill large artificial models without changing their behavior much, I expect you could get rid of many neurons in the brain without changing the actual computation that it performs much or at all.
It seems like you’re modeling the brain as performing some particular exact computation where every bit of noise is counted as useful work. The fact that the brain may be within a couple OOM of fundamental thermodynamic limits of computing exactly what it happens to compute seems not very meaningful as a measure of the fundamental limit of useful computation or cognition possible given particular size and energy specifications.
I made a post which may help explain the analogy between spikes and multiply-accumulate operations.
I could make the exact same argument about some grad student’s first DL experiment running on a GPU, on multiple levels.
I also suspect you could get rid of many neurons in their DL model without changing the computation, I suspect they aren’t working together to do any kind of larger cognition in anywhere closer to the most efficient possible way.
It’s also likely they may not even know how to use the tensorcores efficiently, and even if they did the tensorcores waste most of their compute multiplying by zeros or near zeroes, regardless of how skilled/knowledge-able the DL practitioner.
And yet knowing all this, we still count flops in the obvious way, as counting “hypothetical fully utilized fllops” is not an easy useful quantity to measure discuss and compare.
Utilization of the compute resources is a higher level software/architecture efficiency consideration, not a hardware efficiency measure.
Given a CPU capable of a specified number of FLOPs at a specified precision, I actually can take arbitrary floats at that precision and multiply or add them in arbitrary ways at the specified rate[1].
Not so for brains, for at least a couple of reasons:
An individual neuron can’t necessarily perform an arbitrary multiply / add / accumulate operation, at any particular precision. It may be modeled by an analog MAC of a specified precision over some input range.
The software / architecture point above. For many artificial computations we care about, we can apply both micro (e.g. assembly code optimization) and macro (e.g. using a non-quadratic algorithm for matrix multiplication) optimization to get pretty close to the theoretical limit of efficiency. Maybe the brain is already doing the analog version of these kinds optimizations in some cases. Yes, this is somewhat of a separate / higher-level consideration, but if neurons are less repurposable and rearrangeable than transistors, it’s another reason why the FLOPs to SYNOPs comparison is not appples-to-apples.
modulo some concerns about I/O, generation, checking, and CPU manufacturers inflating their benchmark numbers
And? DL systems just use those floats to simulate large NNs, and a good chunk of recent progress has resulted from moving down to lower precision from 32b to 16b to 8b and soon 4b or lower, chasing after the brain’s carefully tuned use of highly energy efficient low precision ops.
Intelligence requires exploring a circuit space, simulating circuits. The brain is exactly the kind of hardware you need to do that with extreme efficiency given various practical physical constraints.
GPUs/accelerators can match the brain in raw low precision op/s useful for simulating NNs (circuits), but use far more energy to do so and more importantly are also extremely limited by memory bandwidth which results in an extremely poor 100:1 or even 1000:1 alu:mem ratio, which prevents them from accelerating anything other than matrix matrix multiplication, rather than the far more useful sparse vector matrix multiplication.
This is just nonsense. A GPU can not rearrange its internal circuitry to change precision or reallocate operations. A brain can and does by shrinking/expanding synapses, growing new ones, etc.
Give me some floats, I can make a GPU do matrix multiplication, or sparse matrix multiplication, or many other kind of computations at a variety of precisions across the entire domain of floats at that precision.
A brain is (maybe) carrying out a computation which is modeled by a particular bunch of sparse matrix multiplications, in which the programmer has much less control over the inputs, domain, and structure of the computation.
The fact that some process (maybe) irreducibly requires some number of FLOPs to simulate faithfully is different from that process being isomorphic to that computation itself.
Intelligence requires exploring and simulating a large circuit space—ie by using something like gradient descent on neural networks. You can use a GPU to do that inefficiently or you can create custom nanotech analog hardware like the brain.
The brain emulates circuits, and current AI systems on GPUs simulate circuits inspired by the brain’s emulation.
I don’t think neuroplasticity is equivalent to architecting and then doing gradient descent on an artificial neural network. That process is more analogous to billions of years of evolution, which encoded most of the “circuit exploration” process in DNA. In the brain, some of the weights and even connections are adjusted at “runtime”, but the rules for making those connections are necessarily encoded in DNA.
(Also, I flatly don’t buy that any of this is required for intelligence.)
Further item of “these elaborate calculations seem to arrive at conclusions that can’t possibly be true”—besides the brain allegedly being close to the border of thermodynamic efficiency, despite visibly using tens of thousands of redundant physical ops in terms of sheer number of ions and neurotransmitters pumped; the same calculations claim that modern GPUs are approaching brain efficiency, the Limit of the Possible, so presumably at the Limit of the Possible themselves.
This source claims 100x energy efficiency from substituting some basic physical analog operations for multiply-accumulate, instead of digital transistor operations about them, even if you otherwise use actual real-world physical hardware. Sounds right to me; it would make no sense for such a vastly redundant digital computation of such a simple physical quantity to be anywhere near the borders of efficiency! https://spectrum.ieee.org/analog-ai
I’m not sure why you believe “the same calculations claim that modern GPUs are approaching brain efficiency, the Limit of the Possible”. GPUs require at least on order ~1e-11J to fetch a single 8-bit value from GDDRX RAM (1e-19 J/b/nm (interconnect wire energy) * 1cm * 8), so around ~1KW or 100x the brain for 1e14 of those per second, not even including flop energy cost (the brain doesn’t have much more efficient wires, it just minimizes that entire cost by moving the memory synapses/weights as close as possible to the compute .. by merging them). I do claim that Moore’s Law is ending and not delivering much farther increase in CMOS energy efficiency (and essentially zero increase in wire energy efficiency), but GPUs are far from the optimal use of CMOS towards running NNs.
That sounds about right, and Indeed I roughly estimate the minimal energy for 8 bit analog MAC at the end of the synapse section, with 4 refs examples from the research lit:
The more complicated part of comparing these is how/whether to include the cost of reading/writing a synapse/weight value from RAM across a long wire, which is required for full equivalence to the brain. The brain as a true RNN is doing Vector Matrix multiplication, whereas GPUs/Accelerators instead do Matrix Matrix multiplication to amortize the cost of expensive RAM fetches. VM mult can simulate MM mult at no extra cost, but MM mult can only simulate VM mult at huge inefficiency proportional to the minimal matrix size (determined by ALU/RAM ratio, ~1000:1 now at low precision). The full neuromorphic or PIM approach instead moves the RAM next to the processing elements, and is naturally more suited to VM mult.
Bavandpour, Mohammad, et al. “Mixed-Signal Neuromorphic Processors: Quo Vadis?” 2019 IEEE SOI-3D-Subthreshold Microelectronics Technology Unified Conference (S3S). IEEE, 2019. gs-link
Chen, Jia, et al. “Multiply accumulate operations in memristor crossbar arrays for analog computing.” Journal of Semiconductors 42.1 (2021): 013104. gs-link
Li, Huihan, et al. “Memristive crossbar arrays for storage and computing applications.” Advanced Intelligent Systems 3.9 (2021): 2100017. gs-link
Li, Can, et al. “Analogue signal and image processing with large memristor crossbars.” Nature electronics 1.1 (2018): 52-59. gs-link
Mahmoodi, M. Reza, and Dmitri Strukov. “Breaking POps/J barrier with analog multiplier circuits based on nonvolatile memories.” Proceedings of the International Symposium on Low Power Electronics and Design. 2018. gs-link
Okay, if you’re not saying GPUs are getting around as efficient as the human brain, without much more efficiency to be eeked out, then I straightforwardly misunderstood that part.
Could you elaborate on your last paragraph about matrix -matrix multiplication versus vector matrix multiplication. What does this have to do with the RAM being next to the processing units?
(As a general note, I think it would be useful for people trying to follow along if you would explain some of the technical terms you are using. Not everybody is a world-expert in GPU-design! E.g. PIM, CMOS, MAC etc )
Matrix Matrix Mult of square matrices dim N uses ~2N3 ALU ops and ~3N2 MEM ops, so it has an arithmetic intensity of ~N (ALU:MEM ratio).
Vector Matrix Mult of dim N uses ~2N2 ALU and ~3N2 MEM, for an arithmetic intensity of ~1.
A GPU has an ALU:MEM ratio of about 1000:1 (for lower precision tensorcore ALU), so it is inefficient at vector matrix mult by a factor of about 1000 vs matrix matrix mult. The high ALU:MEM ratio is a natural result of the relative wire lengths: very short wire distances to shuffle values between FP units in a tensorcore vs very long wire distances to reach a value in off chip RAM.
What is ALU and MEM exactly? And what is the significance of the ALU:MEM ratio?
The GPU needs numbers to be stored in registers inside the GPU before it can do operations on them. A memory operation (what Jacob calls MEM) is when you load a particular value from memory into a register. An arithmetic operation is when you do an elementary arithmetic operation such as addition or multiplication on two values that have already been loaded into registers. These are done by the arithmetic-logic unit (ALU) of the processor so are called ALU ops.
Because a matrix multiplication of two N×N matrices only involves 2N2 distinct floating point numbers as input, and writing the result back into memory is going to cost you another N2 memory operations, the total MEM ops cost of a matrix multiplication of two matrices of size N×N is 3N2. In contrast, if you’re using the naive matrix multiplication algorithm, computing each entry in the output matrix takes you N additions and N multiplications, so you end up with 2N⋅N2=2N3 ALU ops needed.
The ALU:MEM ratio is important because if your computation is imbalanced relative to what is supported by your hardware then you’ll end up being bottlenecked by one of them and you’ll be unable to exploit the surplus resources you have on the other side. For instance, if you’re working with a bizarre GPU that has a 1:1 ALU:MEM ratio, whenever you’re only using the hardware to do matrix multiplications you’ll have enormous amounts of MEM ops capacity sitting idle because you don’t have the capacity to be utilizing them.
This is helpful, thanks a ton Ege!
The section you were looking for is titled ‘Synapses’.
https://www.lesswrong.com/posts/xwBuoE9p8GE7RAuhd/brain-efficiency-much-more-than-you-wanted-to-know#Synapses
And it says:
This just seems utterly wack. Having any physical equivalent of an analog multiplication fundamentally requires 100,000 times the thermodynamic energy to erase 1 bit? And “analog multiplication down to two decimal places” is the operation that is purportedly being carried out almost as efficiently as physically possible by… an axon terminal with a handful of synaptic vesicles dumping 10,000 neurotransmitter molecules to flood around a dendritic terminal (molecules which will later need to be irreversibly pumped back out), which in turn depolarizes and starts flooding thousands of ions into a cell membrane (to be later pumped out) in order to transmit the impulse at 1m/s? That’s the most thermodynamically efficient a physical cognitive system can possibly be? This is approximately the most efficient possible way to turn all those bit erasures into thought?
This sounds like physical nonsense that fails a basic sanity check. What am I missing?
I am not certain it is being carried “almost as efficiently as physically possible”, assuming you mean thermodynamic efficiency (even accepting you meant thermodynamic efficiency only for irreversible computation), my belief is more that the brain and its synaptic elements are reasonably efficient in a pareto tradeoff sense.
But any discussion around efficiency must make some starting assumptions about what computations the system may be performing. We now have a reasonable amount of direct and indirect evidence—direct evidence from neuroscience, indirect evidence from DL—that allows us some confidence that the brain is conventional (irreversible, non quantum), and is basically very similar to an advanced low power DL accelerator built out of nanotech replicators. (and the clear obvious trend in hardware design is towards the brain)
So starting with that frame ..
A synaptic op is the equivalent of reading an 8b-ish weight from memory, ‘multiplying’ by the incoming spike value, propagating the output down the wire, updating neurotransmitter receptors (which store not just the equivalent of the weight, but the bayesian distribution params on the weight, equivalent to gradient momentum etc), back-propagating spike (in some scenarios), spike decoding (for nonlinear spike timing codes), etc.
It just actually does a fair amount of work, and if you actually query the research literature to see how many transistors that would take it is something like 10k to 100k or more, each of which minimally uses 1eV per op * 10 for interconnect, according to the best micromodels of circuit limits (cavin/Zhirnov).
The analog multiplier and gear is very efficient (especially in space) for low SNRs, but it scales poorly (exponentially) with bit precision (equivalent SNR). From the last papers I recall 8b is the crossover point where digital wins in energy and perhaps size. Below that analog dominates. There are numerous startups working on analog hardware to replace GPUs for low bit precision multipliers, chasing the brain, but its extremely difficult and IMHO may. not be worth it without nanotech.
The brain only runs at 100hz and the axon conduction velocity is optimized just so that every brain region can connect to distal regions without significant delay (delay on order of a millisecond or so).
So the real question is then just why 100hz—which I also answer in brain efficiency. If you have a budget of 10W you can spend that running a very small NN very fast or a very large NN at lower speeds, and the latter seems more useful for biology. Digital minds obviously can spend the energy cost to run at fanastic speeds—and GPT4 was only possible because its NN can run vaguely ~10000x faster than the brain (for training).
I’ll end with an interesting quote from Hinton[1]:
The Forward-Forward Algorithm -section 8
I think the quoted claim is actually straightforwardly true? Or at least, it’s not really surprising that actual precise 8 bit analog multiplication really does require a lot more energy than the energy required to erase one bit.
I think the real problem with the whole section is that it conflates the amount of computation required to model synaptic operation with the amount of computation each synapse actually performs.
These are actually wildly different types of things, and I think the only thing it is justifiable to conclude from this analysis is that (maybe, if the rest of it is correct) it is not possible to simulate the operation of a human brain at synapse granularity, using much less than 10W and 1000 cm^3. Which is an interesting fact if true, but doesn’t seem to have much bearing on the question of whether the brain is close to an optimal substrate for carrying out the abstract computation of human cognition.
(I expanded a little on the point about modeling a computation vs. the computation itself in an earlier sibling reply.)
I’m not sure what you mean by “precise 8 bit analog multiplication”, as analog is not precise in the way digital is. When I say 8-bit analog equivalent, I am talking about an analog operation that has SNR equivalent to quantized 8-bit digital, which is near the maximum useful range for analog multiplier devices, and near the upper range of estimates of synaptic precision.
I was actually imagining some kind of analogue to an 8 bit Analog-to-digital converter. Or maybe an op amp? My analog circuits knowledge is very rusty.
But anyway, if you draw up a model of some synapses as an analog circuit with actual analog components, that actually illustrates one of my main objections pretty nicely: neurons won’t actually meet the same performance specifications of that circuit, even if they behave like or are modeled by those circuits for specific input ranges and a narrow operating regime.
An actual analog circuit has to meet precise performance specifications within a specified operating domain, whether it is comprised of an 8 bit or 2 bit ADC, a high or low quality op amp, etc.
If you draw up a circuit made out of neurons, the performance characteristics and specifications it meets will probably be much more lax. If you relax specifications for a real analog circuit in the same way, you can probably make the circuit out of much cheaper and lower-energy component pieces.
An actual CMOS analog circuit only has to meet those precision performance specifications because it is a design which must be fabricated and reproduced with precision over and over.
The brain doesn’t have that constraint, so it can to some extent learn to exploit the nuances of each specific subcircuit or device. This is almost obviously superior in terms of low level circuit noise tolerance and space and energy efficiency, and is seen by some as the ultimate endpoint of Moore’s Law—see hinton’s Forward Forward section 8 I quoted here
Regardless if you see a neurotransmitter synapse system that is using 10k carriers or whatever flooding through variadic memory-like protein gates such that deep simulations of the system indicate it is doing something similar-ish to analog multiplication with SNR equivalent to 7-bits or whatnot, and you have a bunch of other neuroscience and DL experiments justifying that interpretation, then that is probably what it is doing.
It is completely irrelevant whether it’s a ‘proper’ analog multiplication that would meet precise performance specifications in a mass produced CMOS device. All that matters here is its equivalent computational capability.
Mass production is one reason, but another reason this distinction actually is important is that they are performance characteristics of the whole system, not its constituent pieces. For both analog and digital circuits, these performance characteristics have very precise meanings.
Let’s consider flop/s for digitial circuits.
If I can do 1M flop/s, that means roughly, every second, you can give me 2 million floats, and I can multiply them together pairwise and give you 1 million results, 1 second later. I can do this over and over again every second, and the values can be arbitrarily distributed over the entire domain of floating point numbers at a particular precision.[1]
“Synaptic computations” in the brain, as you describe them, do not have any of these properties. The fact that 10^15 of them happen per second is not equivalent or comparable to 10^15 flop/s, because it is not a performance characteristic of the system as a whole.
By analogy: suppose you have some gas particles in a container, and you’d like to simulate their positions. Maybe the simulation requires 10^15 flop/s to simulate in real time, and there is provably no more efficient way to run your simulation.
Does that mean the particles themselves are doing 10^15 flop/s? No!
Saying the brain does “10^15 synaptic operations per second” is a bit like saying the particles in the gas are doing “10^15 particle operations per second”.
The fact that, in the case of the brain, the operations themselves are performing some kind of useful work that looks like a multiply-add, and that this is maybe within an OOM of some fundamental efficiency limit, doesn’t mean you can coerce the types arbitrarily to say the the brain itself is efficient as a whole.
As a less vacuous analogy, you could do a bunch of analysis on an individual CMOS gate from the 1980s, and find, perhaps, that it is “near the limit of thermodynamic efficiency” in the sense that every microjoule of energy it uses is required to make it actually work. Cooling + overclocking might let you push things a bit, but you’ll never be able to match the performance of re-designing the underlying transistors entirely at a smaller process (which often involves a LOT more than just shrinking individual transistors).
Indeed, Brains and digital circuits have completely different computational capabilities and performance characteristics. That’s kind of the whole point.
If I do this with a CPU, I might have full control over which pairs are multiplied. If I have an ASIC, the pair indices might be fixed. If I have an FPGA, they might be fixed until I reprogram it.
The only advantage of a CPU/GPU over an ASIC is that the CPU/GPU is programmable after device creation. If you know what calculation you want to perform you use an ASIC and avoid the enormous inefficiency of the CPU/GPU simulating the actual circuit you want to use. An FPGA is somewhere in between.
The brain uses active rewiring (and synapse growth/shrinkage) to physically adapt the hardware, which has the flexibility of an FPGA for the purposes of deep learning, but the efficiency of an ASIC.
Or you could make the same argument about a pile of rocks, or a GPU as I noticed earlier. The entire idea of computation is a map territory enforcement, it always requires a mapping between a logical computation and physics.
If you simply assume—as you do—that the brain isn’t computing anything useful (as equivalent to deep learning operations as I believe is overwhelming supported by the evidence), then you can always claim that, but I see no reason to pay attention whatsoever. I suspect you simply haven’t spent the requisite many thousands of hours reading the right DL and neuroscience.
This has a kernel of truth but it is misleading. There are plenty of algorithms that don’t naturally map to circuits, because a step of an algorithm in a circuit costs space, whereas a step of an algorithm in a programmable computer costs only those bits required to encode the task. The inefficiency of dynamic decode can be paid for with large enough algorithms. This is most obvious when considering large tasks on very small machines.
It is true that neither GPUs nor CPUs seem particularly pareto optimal for their broad set of tasks, versus a cleverer clean-sheet design, and it is also true that for any given task you could likely specialize a CPU or GPU design for it somewhat easily for at least marginal benefit, but I also think this is not the default way your comment would be interpreted.
I do not assume this, but I am claiming that something remains to be shown, namely, that human cognition irreducibly requires any of those 10^15 “synaptic computations”.
Showing such a thing necessarily depends on an understanding of the nature of cognition at the software / algorithms / macro-architecture level. Your original post explicitly disclaims engaging with this question, which is perfectly fine as a matter of topic choice, but you then can’t make any claims which depend on such an understanding.
Absent such an understanding, you can still make apples-to-apples comparisons about overall performance characteristics between digital and biological systems. But those _must_ be grounded in an actual precise performance metric of the system as a whole, if they are to be meaningful at all.
Component-wise analysis is not equivalent to system-wide analysis, even if your component-wise analysis is precise and backed by a bunch of neuroscience results and intuitions from artificial deep learning.
FYI for Jacob and others, I am probably not going to further engage directly with Jacob, as we seem to be mostly talking past each other, and I find his tone (“this is just nonsense”, “completely irrelevant”, “suspect you simply haven’t spent...”, etc.) and style of argument to be tiresome.
Obviously it requires some of those computations, but in my ontology the question of how many is clearly a software efficiency question. The fact that an A100 can do ~1e15 low precision op/s (with many caveats/limitations) is a fact about the hardware that tells you nothing about how efficiently any specific A100 may be utilizing that potential. I claim that the brain can likewise do very roughly 1e15 synaptic ops/s, but that questions of utilization of that potential towards intelligence are likewise circuit/software efficiency questions (which I do address in some of my writing, but it is specifically out of scope for this particular question of synaptic hardware.)
My original post does engage with this some in the circuit efficiency section. I draw the circuit/software distinction around architectural prior and learning algorithms (genetic/innate) vs acquired knowledge/skills (cultural).
I used that in response to you saying “but if neurons are less repurposable and rearrangeable than transistors,”, which I do believe is actually nonsense, because neural circuits literally dynamically rewire themselves, which allows the flexibility of FPGAs (for circuit learning) combined with the efficiency of ASICs, and transistors are fixed circuits not dynamically modifiable at all.
If I was to try and steelman your position, it is simply that we can not be sure how efficiently the brain utilizes the potential of its supposed synaptic computational power.
To answer that question, I have provided some of the relevant arguments in my past writing, but at this point given the enormous success of DL (which I predicted well in advance) towards AGI and the great extent to which it has reverse engineered the brain, combined with the fact that moore’s law shrinkage is petering out and the brain remains above the efficiency of our best accelerators, entirely shifts the burden on to you to write up detailed analysis/arguments as to how you can explain these facts.
I think there’s just not that much to explain, here—to me, human-level cognition just doesn’t seem that complicated or impressive in an absolute sense—it is performed by a 10W computer designed by a blind idiot god, after all.
The fact that current DL paradigm methods inspired by its functionality have so far failed to produce artificial cognition of truly comparable quality and efficiency seems more like a failure of those methods rather than a success, at least so far. I don’t expect this trend to continue in the near term (which I think we agree on), and grant you some bayes points for predicting it further in advance.
I was actually referring to the flexibility and re-arrangability at design time here. Verilog and Cadence can make more flexible use of logic gates and transistors than the brain can make of neurons during a lifetime, and the design space available to circuit designers using these tools is much wider than the one available to evolution.
A sanity check of a counterintuitive claim can be that the argument to the claim implies things that seem unjustifiable or false. It cannot be that the conclusion of the claim itself is unjustifiable or false, except inasmuch as you are willing to deny the possibility to be convinced of that claim by argument at all.
(To avoid confusion, this is not in response to the latter portion of your comment about general cognition.)
If you read carefully, Brain Efficiency does actually have some disclaimers to the effect that it’s discussing the limits of irreversible computing using technology that exists or might be developed in the near future. See Jacob’s comment here for examples: https://www.lesswrong.com/posts/mW7pzgthMgFu9BiFX/the-brain-is-not-close-to-thermodynamic-limits-on?commentId=y3EgjwDHysA2W3YMW
In terms of what the actual fundamental thermodynamic limits are, Jacob and I still disagree by a factor of about 50. (Basically, Jacob thinks the dissipated energy needs to be amped up in order to erase a bit with high reliability. I think that while there are some schemes where this is necessary, there are others where it is not and high-reliability erasure is possible with an energy per bit approaching kTlog2. I’m still working through the math to check that I’m actually correct about this, though.)
If you read landauers paper carefully he analyzes 3 sources of noise and kTlog2 is something like the speed of light for bit energy , only achieved at useless 50% error rate and or glacial speeds.
That’s only for the double well model, though, and only for erasing by lifting up one of the wells. I didn’t see a similar theorem proven for a general system. So the crucial question is whether it’s still true in general. I’ll get back to you eventually on that, I’m still working through the math. It may well turn out that you’re right.
I believe the double well model—although it sounds somewhat specific at a glance—is actually a fully universal conceptual category over all relevant computational options for representing a bit.
You can represent a bit with dominoes, in which case the two bistable states are up/down, you can represent it with few electron quantum dots in one of two orbital configs, or larger scale wire charge changes, or perhaps fluid pressure waves, or ..
The exact form doesn’t matter, as a bit always requires a binary classification between two partitions of device microstates, which leads to success probability being some exponential function of switching energy over noise energy. It’s equivalent to a binary classification task for maxwell’s demon.
Let me know how much time you need to check the math. I’d like to give the option to make an entry for the prize.
Finished, the post is here: https://www.lesswrong.com/posts/PyChB935jjtmL5fbo/time-and-energy-costs-to-erase-a-bit
Summary of the conclusions is that energy on the order of kT should work fine for erasing a bit with high reliability, and the ~50kT claimed by Jacob is not a fully universal limit.
Sorry for the slow response, I’d guess 75% chance that I’m done by May 8th. Up to you whether you want to leave the contest open for that long.
Okay, I’ve finished checking my math and it seems I was right. See post here for details: https://www.lesswrong.com/posts/PyChB935jjtmL5fbo/time-and-energy-costs-to-erase-a-bit