Eliezer Yudkowsky comments on $250 prize for checking Jake Cannell’s Brain Efficiency

Eliezer Yudkowsky 29 Apr 2023 1:18 UTC
22 points
9
And it says:
So true 8-bit equivalent analog multiplication requires about 100k carriers/switches
This just seems utterly wack. Having any physical equivalent of an analog multiplication fundamentally requires 100,000 times the thermodynamic energy to erase 1 bit? And “analog multiplication down to two decimal places” is the operation that is purportedly being carried out almost as efficiently as physically possible by… an axon terminal with a handful of synaptic vesicles dumping 10,000 neurotransmitter molecules to flood around a dendritic terminal (molecules which will later need to be irreversibly pumped back out), which in turn depolarizes and starts flooding thousands of ions into a cell membrane (to be later pumped out) in order to transmit the impulse at 1m/s? That’s the most thermodynamically efficient a physical cognitive system can possibly be? This is approximately the most efficient possible way to turn all those bit erasures into thought?
This sounds like physical nonsense that fails a basic sanity check. What am I missing?
- jacob_cannell 29 Apr 2023 3:44 UTC
  19 points
  8
  Parent
  And “analog multiplication down to two decimal places” is the operation that is purportedly being carried out almost as efficiently as physically possible by
  
  I am not certain it is being carried “almost as efficiently as physically possible”, assuming you mean thermodynamic efficiency (even accepting you meant thermodynamic efficiency only for irreversible computation), my belief is more that the brain and its synaptic elements are reasonably efficient in a pareto tradeoff sense.
  
  But any discussion around efficiency must make some starting assumptions about what computations the system may be performing. We now have a reasonable amount of direct and indirect evidence—direct evidence from neuroscience, indirect evidence from DL—that allows us some confidence that the brain is conventional (irreversible, non quantum), and is basically very similar to an advanced low power DL accelerator built out of nanotech replicators. (and the clear obvious trend in hardware design is towards the brain)
  
  So starting with that frame ..
  
  Having any physical equivalent of an analog multiplication fundamentally requires 100,000 times the thermodynamic energy to erase 1 bit?
  
  A synaptic op is the equivalent of reading an 8b-ish weight from memory, ‘multiplying’ by the incoming spike value, propagating the output down the wire, updating neurotransmitter receptors (which store not just the equivalent of the weight, but the bayesian distribution params on the weight, equivalent to gradient momentum etc), back-propagating spike (in some scenarios), spike decoding (for nonlinear spike timing codes), etc.
  
  It just actually does a fair amount of work, and if you actually query the research literature to see how many transistors that would take it is something like 10k to 100k or more, each of which minimally uses 1eV per op * 10 for interconnect, according to the best micromodels of circuit limits (cavin/Zhirnov).
  
  The analog multiplier and gear is very efficient (especially in space) for low SNRs, but it scales poorly (exponentially) with bit precision (equivalent SNR). From the last papers I recall 8b is the crossover point where digital wins in energy and perhaps size. Below that analog dominates. There are numerous startups working on analog hardware to replace GPUs for low bit precision multipliers, chasing the brain, but its extremely difficult and IMHO may. not be worth it without nanotech.
  
  in order to transmit the impulse at 1m/s?
  
  The brain only runs at 100hz and the axon conduction velocity is optimized just so that every brain region can connect to distal regions without significant delay (delay on order of a millisecond or so).
  
  So the real question is then just why 100hz—which I also answer in brain efficiency. If you have a budget of 10W you can spend that running a very small NN very fast or a very large NN at lower speeds, and the latter seems more useful for biology. Digital minds obviously can spend the energy cost to run at fanastic speeds—and GPT4 was only possible because its NN can run vaguely ~10000x faster than the brain (for training).
  
  I’ll end with an interesting quote from Hinton^[1]:
  
  The separation of software from hardware is one of the foundations of Computer Science and it has many benefits. It makes it possible to study the properties of programs without worrying about electrical engineering. It makes it possible to write a program once and copy it to millions of computers. If, however, we are willing to abandon immortality it should be possible to achieve huge savings in the energy required to perform a computation and in the cost of fabricating the hardware that executes the computation. We can allow large and unknown variations in the connectivity and non-linearities of different instances of hardware that are intended to perform the same task and rely on a learning procedure to discover parameter values that make effective use of the unknown properties of each particular instance of the hardware. These parameter values are only useful for that specific hardware instance, so the computation they perform is mortal: it dies with the hardware.
  ↩︎
  The Forward-Forward Algorithm -section 8
  What links here?
  - jacob_cannell's comment on $250 prize for checking Jake Cannell’s Brain Efficiency by Alexander Gietelink Oldenziel (29 Apr 2023 4:03 UTC; 23 points)
  - jacob_cannell's comment on $250 prize for checking Jake Cannell’s Brain Efficiency by Alexander Gietelink Oldenziel (29 Apr 2023 5:15 UTC; 3 points)
- Max H 29 Apr 2023 2:26 UTC
  12 points
  5
  Parent
  I think the quoted claim is actually straightforwardly true? Or at least, it’s not really surprising that actual precise 8 bit analog multiplication really does require a lot more energy than the energy required to erase one bit.
  I think the real problem with the whole section is that it conflates the amount of computation required to model synaptic operation with the amount of computation each synapse actually performs.
  These are actually wildly different types of things, and I think the only thing it is justifiable to conclude from this analysis is that (maybe, if the rest of it is correct) it is not possible to simulate the operation of a human brain at synapse granularity, using much less than 10W and 1000 cm^3. Which is an interesting fact if true, but doesn’t seem to have much bearing on the question of whether the brain is close to an optimal substrate for carrying out the abstract computation of human cognition.
  (I expanded a little on the point about modeling a computation vs. the computation itself in an earlier sibling reply.)
  - jacob_cannell 29 Apr 2023 3:58 UTC
    2 points
    0
    Parent
    
    Or at least, it’s not really surprising that actual precise 8 bit analog multiplication
    
    I’m not sure what you mean by “precise 8 bit analog multiplication”, as analog is not precise in the way digital is. When I say 8-bit analog equivalent, I am talking about an analog operation that has SNR equivalent to quantized 8-bit digital, which is near the maximum useful range for analog multiplier devices, and near the upper range of estimates of synaptic precision.
    - Max H 29 Apr 2023 4:31 UTC
      2 points
      1
      Parent
      I was actually imagining some kind of analogue to an 8 bit Analog-to-digital converter. Or maybe an op amp? My analog circuits knowledge is very rusty.
      But anyway, if you draw up a model of some synapses as an analog circuit with actual analog components, that actually illustrates one of my main objections pretty nicely: neurons won’t actually meet the same performance specifications of that circuit, even if they behave like or are modeled by those circuits for specific input ranges and a narrow operating regime.
      An actual analog circuit has to meet precise performance specifications within a specified operating domain, whether it is comprised of an 8 bit or 2 bit ADC, a high or low quality op amp, etc.
      If you draw up a circuit made out of neurons, the performance characteristics and specifications it meets will probably be much more lax. If you relax specifications for a real analog circuit in the same way, you can probably make the circuit out of much cheaper and lower-energy component pieces.
      - jacob_cannell 29 Apr 2023 5:15 UTC
        3 points
        −1
        Parent
        An actual CMOS analog circuit only has to meet those precision performance specifications because it is a design which must be fabricated and reproduced with precision over and over.
        
        The brain doesn’t have that constraint, so it can to some extent learn to exploit the nuances of each specific subcircuit or device. This is almost obviously superior in terms of low level circuit noise tolerance and space and energy efficiency, and is seen by some as the ultimate endpoint of Moore’s Law—see hinton’s Forward Forward section 8 I quoted here
        
        Regardless if you see a neurotransmitter synapse system that is using 10k carriers or whatever flooding through variadic memory-like protein gates such that deep simulations of the system indicate it is doing something similar-ish to analog multiplication with SNR equivalent to 7-bits or whatnot, and you have a bunch of other neuroscience and DL experiments justifying that interpretation, then that is probably what it is doing.
        
        It is completely irrelevant whether it’s a ‘proper’ analog multiplication that would meet precise performance specifications in a mass produced CMOS device. All that matters here is its equivalent computational capability.
        Max H 29 Apr 2023 12:57 UTC
        5 points
        −1
        Parent
        An actual CMOS analog circuit only has to meet those precision performance specifications because it is a design which must be fabricated and reproduced with precision over and over.
        Mass production is one reason, but another reason this distinction actually is important is that they are performance characteristics of the whole system, not its constituent pieces. For both analog and digital circuits, these performance characteristics have very precise meanings.
        Let’s consider flop/s for digitial circuits.
        If I can do 1M flop/s, that means roughly, every second, you can give me 2 million floats, and I can multiply them together pairwise and give you 1 million results, 1 second later. I can do this over and over again every second, and the values can be arbitrarily distributed over the entire domain of floating point numbers at a particular precision.^[1]
        “Synaptic computations” in the brain, as you describe them, do not have any of these properties. The fact that 10^15 of them happen per second is not equivalent or comparable to 10^15 flop/s, because it is not a performance characteristic of the system as a whole.
        By analogy: suppose you have some gas particles in a container, and you’d like to simulate their positions. Maybe the simulation requires 10^15 flop/s to simulate in real time, and there is provably no more efficient way to run your simulation.
        Does that mean the particles themselves are doing 10^15 flop/s? No!
        Saying the brain does “10^15 synaptic operations per second” is a bit like saying the particles in the gas are doing “10^15 particle operations per second”.
        The fact that, in the case of the brain, the operations themselves are performing some kind of useful work that looks like a multiply-add, and that this is maybe within an OOM of some fundamental efficiency limit, doesn’t mean you can coerce the types arbitrarily to say the the brain itself is efficient as a whole.
        As a less vacuous analogy, you could do a bunch of analysis on an individual CMOS gate from the 1980s, and find, perhaps, that it is “near the limit of thermodynamic efficiency” in the sense that every microjoule of energy it uses is required to make it actually work. Cooling + overclocking might let you push things a bit, but you’ll never be able to match the performance of re-designing the underlying transistors entirely at a smaller process (which often involves a LOT more than just shrinking individual transistors).
        It is completely irrelevant whether it’s a ‘proper’ analog multiplication that would meet precise performance specifications in a mass produced CMOS device. All that matters here is its equivalent computational capability.
        Indeed, Brains and digital circuits have completely different computational capabilities and performance characteristics. That’s kind of the whole point.
        ^
        If I do this with a CPU, I might have full control over which pairs are multiplied. If I have an ASIC, the pair indices might be fixed. If I have an FPGA, they might be fixed until I reprogram it.
        What links here?
        Max H's comment on $250 prize for checking Jake Cannell’s Brain Efficiency by Alexander Gietelink Oldenziel (29 Apr 2023 13:10 UTC; 11 points)
        jacob_cannell 29 Apr 2023 13:33 UTC
        −1 points
        −3
        Parent
        
        (If I do this with a CPU, I might have full control over which pairs are multiplied. If I have an ASIC, the pair indices might be fixed. If I have an FPGA, they might be fixed until I reprogram it.)
        
        The only advantage of a CPU/GPU over an ASIC is that the CPU/GPU is programmable after device creation. If you know what calculation you want to perform you use an ASIC and avoid the enormous inefficiency of the CPU/GPU simulating the actual circuit you want to use. An FPGA is somewhere in between.
        
        The brain uses active rewiring (and synapse growth/shrinkage) to physically adapt the hardware, which has the flexibility of an FPGA for the purposes of deep learning, but the efficiency of an ASIC.
        
        As a less vacuous analogy, you could do a bunch of analysis on an individual CMOS gate from the 1980s, and find, perhaps, that it is “near the limit of thermodynamic efficiency”
        
        Or you could make the same argument about a pile of rocks, or a GPU as I noticed earlier. The entire idea of computation is a map territory enforcement, it always requires a mapping between a logical computation and physics.
        
        If you simply assume—as you do—that the brain isn’t computing anything useful (as equivalent to deep learning operations as I believe is overwhelming supported by the evidence), then you can always claim that, but I see no reason to pay attention whatsoever. I suspect you simply haven’t spent the requisite many thousands of hours reading the right DL and neuroscience.
        Veedrac 29 Apr 2023 18:39 UTC
        7 points
        4
        Parent
        
        The only advantage of a CPU/GPU over an ASIC is that the CPU/GPU is programmable after device creation. If you know what calculation you want to perform you use an ASIC and avoid the enormous inefficiency of the CPU/GPU simulating the actual circuit you want to use
        
        This has a kernel of truth but it is misleading. There are plenty of algorithms that don’t naturally map to circuits, because a step of an algorithm in a circuit costs space, whereas a step of an algorithm in a programmable computer costs only those bits required to encode the task. The inefficiency of dynamic decode can be paid for with large enough algorithms. This is most obvious when considering large tasks on very small machines.
        
        It is true that neither GPUs nor CPUs seem particularly pareto optimal for their broad set of tasks, versus a cleverer clean-sheet design, and it is also true that for any given task you could likely specialize a CPU or GPU design for it somewhat easily for at least marginal benefit, but I also think this is not the default way your comment would be interpreted.
        Max H 29 Apr 2023 14:56 UTC
        2 points
        −1
        Parent
        If you simply assume—as you do—that the brain isn’t computing anything useful
        
        I do not assume this, but I am claiming that something remains to be shown, namely, that human cognition irreducibly requires any of those 10^15 “synaptic computations”.
        Showing such a thing necessarily depends on an understanding of the nature of cognition at the software / algorithms / macro-architecture level. Your original post explicitly disclaims engaging with this question, which is perfectly fine as a matter of topic choice, but you then can’t make any claims which depend on such an understanding.
        Absent such an understanding, you can still make apples-to-apples comparisons about overall performance characteristics between digital and biological systems. But those _must_ be grounded in an actual precise performance metric of the system as a whole, if they are to be meaningful at all.
        Component-wise analysis is not equivalent to system-wide analysis, even if your component-wise analysis is precise and backed by a bunch of neuroscience results and intuitions from artificial deep learning.
        
        FYI for Jacob and others, I am probably not going to further engage directly with Jacob, as we seem to be mostly talking past each other, and I find his tone (“this is just nonsense”, “completely irrelevant”, “suspect you simply haven’t spent...”, etc.) and style of argument to be tiresome.
        jacob_cannell 29 Apr 2023 15:29 UTC
        3 points
        −3
        Parent
        
        I am claiming that something remains to be shown, namely, that human cognition irreducibly requires any of those 10^15 “synaptic computations”.
        
        Obviously it requires some of those computations, but in my ontology the question of how many is clearly a software efficiency question. The fact that an A100 can do ~1e15 low precision op/s (with many caveats/limitations) is a fact about the hardware that tells you nothing about how efficiently any specific A100 may be utilizing that potential. I claim that the brain can likewise do very roughly 1e15 synaptic ops/s, but that questions of utilization of that potential towards intelligence are likewise circuit/software efficiency questions (which I do address in some of my writing, but it is specifically out of scope for this particular question of synaptic hardware.)
        
        Showing such a thing necessarily depends on an understanding of the nature of cognition at the software / algorithms / macro-architecture level. Your original post explicitly disclaims engaging with this question,
        
        My original post does engage with this some in the circuit efficiency section. I draw the circuit/software distinction around architectural prior and learning algorithms (genetic/innate) vs acquired knowledge/skills (cultural).
        
        I find his tone (“this is just nonsense”,
        
        I used that in response to you saying “but if neurons are less repurposable and rearrangeable than transistors,”, which I do believe is actually nonsense, because neural circuits literally dynamically rewire themselves, which allows the flexibility of FPGAs (for circuit learning) combined with the efficiency of ASICs, and transistors are fixed circuits not dynamically modifiable at all.
        
        If I was to try and steelman your position, it is simply that we can not be sure how efficiently the brain utilizes the potential of its supposed synaptic computational power.
        
        To answer that question, I have provided some of the relevant arguments in my past writing, but at this point given the enormous success of DL (which I predicted well in advance) towards AGI and the great extent to which it has reverse engineered the brain, combined with the fact that moore’s law shrinkage is petering out and the brain remains above the efficiency of our best accelerators, entirely shifts the burden on to you to write up detailed analysis/arguments as to how you can explain these facts.
        Max H 29 Apr 2023 16:14 UTC
        4 points
        1
        Parent
        To answer that question, I have provided some of the relevant arguments in my past writing, but at this point given the enormous success of DL (which I predicted well in advance) towards AGI and the great extent to which it has reverse engineered the brain, combined with the fact that moore’s law shrinkage is petering out and the brain remains above the efficiency of our best accelerators, entirely shifts the burden on to you to write up detailed analysis/arguments as to how you can explain these facts.
        I think there’s just not that much to explain, here—to me, human-level cognition just doesn’t seem that complicated or impressive in an absolute sense—it is performed by a 10W computer designed by a blind idiot god, after all.
        The fact that current DL paradigm methods inspired by its functionality have so far failed to produce artificial cognition of truly comparable quality and efficiency seems more like a failure of those methods rather than a success, at least so far. I don’t expect this trend to continue in the near term (which I think we agree on), and grant you some bayes points for predicting it further in advance.
        Max H 29 Apr 2023 15:38 UTC
        4 points
        3
        Parent
        
        If I was to try and steelman your position, it is simply that we can not be sure how efficiently the brain utilizes the potential of its supposed synaptic computational power.
        
        I was actually referring to the flexibility and re-arrangability at design time here. Verilog and Cadence can make more flexible use of logic gates and transistors than the brain can make of neurons during a lifetime, and the design space available to circuit designers using these tools is much wider than the one available to evolution.
- Veedrac 29 Apr 2023 4:52 UTC
  8 points
  5
  Parent
  A sanity check of a counterintuitive claim can be that the argument to the claim implies things that seem unjustifiable or false. It cannot be that the conclusion of the claim itself is unjustifiable or false, except inasmuch as you are willing to deny the possibility to be convinced of that claim by argument at all.
  
  (To avoid confusion, this is not in response to the latter portion of your comment about general cognition.)
- DaemonicSigil 29 Apr 2023 4:38 UTC
  7 points
  2
  Parent
  If you read carefully, Brain Efficiency does actually have some disclaimers to the effect that it’s discussing the limits of irreversible computing using technology that exists or might be developed in the near future. See Jacob’s comment here for examples: https://www.lesswrong.com/posts/mW7pzgthMgFu9BiFX/the-brain-is-not-close-to-thermodynamic-limits-on?commentId=y3EgjwDHysA2W3YMW
  
  In terms of what the actual fundamental thermodynamic limits are, Jacob and I still disagree by a factor of about 50. (Basically, Jacob thinks the dissipated energy needs to be amped up in order to erase a bit with high reliability. I think that while there are some schemes where this is necessary, there are others where it is not and high-reliability erasure is possible with an energy per bit approaching $k T log 2$ . I’m still working through the math to check that I’m actually correct about this, though.)
  - jacob_cannell 29 Apr 2023 4:43 UTC
    2 points
    0
    Parent
    If you read landauers paper carefully he analyzes 3 sources of noise and $k T l o g 2$ is something like the speed of light for bit energy , only achieved at useless 50% error rate and or glacial speeds.
    What links here?
    Time and Energy Costs to Erase a Bit by DaemonicSigil (6 May 2023 23:29 UTC; 24 points)
    - DaemonicSigil 29 Apr 2023 4:57 UTC
      9 points
      0
      Parent
      That’s only for the double well model, though, and only for erasing by lifting up one of the wells. I didn’t see a similar theorem proven for a general system. So the crucial question is whether it’s still true in general. I’ll get back to you eventually on that, I’m still working through the math. It may well turn out that you’re right.
      - jacob_cannell 29 Apr 2023 15:35 UTC
        4 points
        0
        Parent
        I believe the double well model—although it sounds somewhat specific at a glance—is actually a fully universal conceptual category over all relevant computational options for representing a bit.
        
        You can represent a bit with dominoes, in which case the two bistable states are up/down, you can represent it with few electron quantum dots in one of two orbital configs, or larger scale wire charge changes, or perhaps fluid pressure waves, or ..
        
        The exact form doesn’t matter, as a bit always requires a binary classification between two partitions of device microstates, which leads to success probability being some exponential function of switching energy over noise energy. It’s equivalent to a binary classification task for maxwell’s demon.
        What links here?
        Time and Energy Costs to Erase a Bit by DaemonicSigil (6 May 2023 23:29 UTC; 24 points)
      - Alexander Gietelink Oldenziel 29 Apr 2023 7:43 UTC
        2 points
        0
        Parent
        Let me know how much time you need to check the math. I’d like to give the option to make an entry for the prize.
        DaemonicSigil 7 May 2023 0:34 UTC
        11 points
        2
        Parent
        Finished, the post is here: https://www.lesswrong.com/posts/PyChB935jjtmL5fbo/time-and-energy-costs-to-erase-a-bit
        
        Summary of the conclusions is that energy on the order of $k T$ should work fine for erasing a bit with high reliability, and the ~ $50 k T$ claimed by Jacob is not a fully universal limit.
        DaemonicSigil 1 May 2023 19:08 UTC
        1 point
        0
        Parent
        Sorry for the slow response, I’d guess 75% chance that I’m done by May 8th. Up to you whether you want to leave the contest open for that long.
    - DaemonicSigil 7 May 2023 0:31 UTC
      3 points
      0
      Parent
      Okay, I’ve finished checking my math and it seems I was right. See post here for details: https://www.lesswrong.com/posts/PyChB935jjtmL5fbo/time-and-energy-costs-to-erase-a-bit