Comments on “The Singularity is Nowhere Near”

I followed a link on Twitter to a fun and informative 2015 blog post by Tim Dettmers:

The Brain vs Deep Learning Part I: Computational Complexity — Or Why the Singularity Is Nowhere Near

The headline conclusion is that it takes at least FLOP/​s to run the algorithms of a human brain, and therefore “it is unlikely that there will be a technological singularity in this century.” I disagree with that, and this post explores why.

(Specifically, I disagree with “at least FLOP/​s”. There’s a separate step to go from “at least FLOP/​s” to “it is unlikely that there will be a technological singularity in this century”—this step is related to Moore’s law, bandwidth requirements for parallelization, etc. Tim’s blog post has extensive discussion of this second step, and I won’t say anything about that here; I’d have to think about it more.)

(I’m writing this in 2021, six years later, but Tim has a comment on this very site that says he still stands by that post; in fact he now goes even further and says “I believe that AGI will be physically impossible with classical computers.”)

I highly recommend the original post. Indeed, if I didn’t like the post so much, I would not have bothered writing a response. :-)

Are brain algorithms computationally expensive to simulate?

Yes! Definitely! I think it’s especially telling that nobody has applied the Dileep George brain-inspired image-processing model to ImageNet, sticking to much smaller images with far fewer categories of objects (MNIST, CAPTCHAs etc.).

Likewise, this Randall O’Reilly paper has a fascinating computational exploration of (in my opinion) different and complementary aspects of the human visual system. That paper tests its theories on a set of ≈1000 256×256-pixel, 8-frame movies from 100 categories—compare to ImageNet’s 14 million images from 20,000 categories … or compare it to the number of visual categories that you can recognize. Training the model still took 512 InfiniBand-connected processor nodes running for ≈24 hours on their campus supercomputer (source: personal communication). The real human vision system is dramatically larger and more complicated than this model, and the whole brain is larger and more complicated still!

But, when I say “computationally expensive to simulate” above, I mean it in, like, normal-person-in-2021 standards of what’s computationally expensive to simulate. A very different question is whether the brain is “computationally expensive to simulate” by the standards of GPT-3, the standards of big tech data centers, the standards of “what will be feasible in 2030 or 2040 or 2050”, and things like that. There, I don’t have a strong opinion. I consider it an open question.

Note also that the two brain-inspired image-recognition examples just above are pushing innovative algorithms, and therefore are presumably handicapped by things like

  1. there are probably variations & tweaks on the algorithms that make them work better and faster, and they have would not have been discovered yet,

  2. nobody has designed new ASICs to run these algorithms efficiently—analogous to how GPU/​TPUs are now designed around matrix multiplication and deep neural nets,

  3. probably not much effort has gone into making the existing algorithms run optimally on existing hardware.

So anyway, the fact that a couple of today’s “most brain-like algorithms” (as judged by me) seem to be computationally expensive to scale up is not much evidence one way or the other for whether brain-like AGI algorithms would be “computationally expensive” with industrial-scale investment in the long-term or even short-term. Again, I consider it an open question.

Tim’s blog post argues that it is not an open question: his estimate is FLOP/​s to run the algorithms of a human brain, which (he says) puts it out of reach for the century, and maybe (as in his recent comment) simply beyond what you can do with a classical computer. And he says that’s an underestimate!

This is quite a bit more skeptical than Joseph Carlsmith’s recent OpenPhil report “How Much Computational Power Does It Take to Match the Human Brain?”. That offers many estimation methods which come in at—FLOP/​s, with being an extreme upper end.

What accounts for the discrepancy?

Where does Tim’s estimate of FLOP/​s come from?

(Be warned that it’s very possible I’m misunderstanding something, and that I have zero experience simulating neurons. I’ve simulated lots of other things, and I’ve read about simulating neurons, but that’s different from actually making a neuron simulation with my own hands.)

Let’s just jump to the headline calculation:

.

Let’s go through the terms one by one.

  • 8.6e10 is the 86 billion neurons in an adult brain. We’re going to calculate the FLOP/​s to simulate a “typical” neuron and then multiply it by 86 billion. Of course, different neurons have different complexity to simulate—you can read Tim’s blog post for examples, including more specific calculations related to two particular neuron types in the cerebellum, but anyway, I’m on board so far.

  • 200 is the number of times per second that you need to do a time-step of the simulation—in other words, the idea is that you want ~5ms time resolution. OK sure, that sounds like the right order of magnitude, as far as I know. I mean, I have the impression that better time-resolution than that is important for some processes, but on the other hand, you don’t necessarily need to do a fresh calculation from scratch each 5ms. Whatever, I dunno, let’s stick with the proposed 200Hz and move on.

  • 10,000 is the number of synapses to a typical neuron. AI impacts says it should only be 2000-3000. Not sure what the deal is with that. Again it’s different for different neurons. Whatever, close enough.

  • The first “5” is an assumption that we need to do a separate floating-point operation involving the state of each synapse during each of the 5 most recent timesteps—in other words, to do the calculation for timestep N, the assumption here is that we need to do a separate operation involving the state of each synapse in timestep N, N-1, N-2, N-3, and N-4. I’ll get back to this.

  • The next 5×50 is the number of dendrite branches (50 dendrites, 5 branches per dendrite). If you don’t know, neurons get their inputs from dendrites, which (at least for some neurons) form a remarkably huge tree-like structure. (Of course, this being biology, you can always find crazy exceptions to every rule, like that one backwards neuron that sends outputs into its dendrites. But I digress.) Dendrite branches are important because of “dendritic spikes”, a spike that travels along a dendrite and its branches, also affects neighboring dendrites to some extent, and might or might not trigger a proper neuron spike that goes down the axon. See Tim’s post for more on this fascinating topic.

  • The last factor of 5 is, again, the assumption that there are slowly-dissipating effects such that to calculate what’s going on in timestep N, we need to do a separate operation involving the state of each branch in timestep N, N-1, N-2, N-3, and N-4.

So all in all, the implicit story behind multiplying these numbers together is:

Take each neuron A in each timestep B. Then take each synapse C on that neuron, and take each dendritic branch D on that neuron. Take one of the five most recent timesteps E for the synapse, and another one of the five most recent timesteps F for the dendritic branch. Now do at least one floating-point operation involving these particular ingredients, and repeat for all possible combinations.

I say “no way”. That just can’t be right, can it?

Let’s start with the idea of multiplying the number of synapses by the number of branches. So take a random synapse (synapse #49) and independently take a random branch of a random dendrite (branch #12). Most of the time the synapse is not that branch, and indeed not even on that dendrite! Why would we need to do a calculation specifically involving those two things?

If any influence can spread from a synapse way over here to a branch way over there, I think it would be the kind of thing that can be dealt with in a hierarchical calculation. Like, from the perspective of dendrite #6, you don’t need to know the fine-grained details of what’s happening in each individual synapse on dendrite #2; all you need to know is some aggregated measure of what’s going on at dendrite #2, e.g. whether it’s spiking, what mix of chemicals it’s dumping out into the soma, or whatever.

So I want to say that the calculation is not O(number of synapses × number of branches), but rather O(number of synapses) + O(number of branches). You do calculations for each synapse, then you do calculations for each branch (or each segment of each branch) that gradually aggregate the effects of those synapses over larger scales. Or something like that.

Next, the time model. I disagree with this too. Again, Tim is budgeting 5×5=25 operations per timestep to deal with time-history. The idea is that at timestep N, you’re doing a calculation involving “the state of synapse #18 in timestep (N-3) and of branch #59 in timestep (N-1)”, and a different calculation for (N-1) and (N-4), and yet another for (N-2) and (N), etc. etc. I don’t think that’s how it would work. Instead I imagine that you would track a bunch of state variables for the neuron, and update the state each timestep. Then your timestep calculation would input the previous state and what’s happening now, and would output the new state. So I think it should be a factor of order 1 to account for effects that are prolonged in time. Admittedly, you could say that the number “25″ is arguably “a factor of order 1”, but whatever. :-P

Oh, also, in a typical timestep, most synapses haven’t fired for the previous hundreds of milliseconds, so you get another order of magnitude or so reduction in computational cost from sparsity.

So put all that together, and now my back-of-the-envelope is like 50,000× lower than Tim’s.

(By the way, please don’t divide FLOP/​s by 50,000 and call it “Steve’s estimate of the computational cost of brain simulations”. This is a negative case against the number, not a positive case for any model in particular. If you want my opinion, I don’t have one right now, as I said above. In the meantime I defer to the OpenPhil report.)

(Parts of this section are copying points made in the comment section of Tim’s blog.)

(Also, my favorite paper proposing an algorithmic purpose of dendritic spikes in cortical pyramidal neurons basically proposes that it functions as an awfully simple set of ANDs and ORs, more or less. I don’t read too much into that—I think the dendritic spikes are doing other computations too, which might or might not be more complicated. But I find that example suggestive.)

What about dynamic gene expression, axonal computations, subthreshold learning, etc.?

To be clear, Tim posited that the FLOP/​s was an underestimate, because there were lots of other complications neglected by this model. Here’s a quote from his post:

Here is small list of a few important discoveries made in the last two years [i.e. 2013-2015] which increase the computing power of the brain by many orders of magnitude:

  • It was shown that brain connections rather than being passive cables, can themselves process information and alter the behavior of neurons in meaningful ways, e.g. brain connections help you to see the objects in everyday life. This fact alone increases brain computational complexity by several orders of magnitude

  • Neurons which do not fire still learn: There is much more going on than electrical spikes in neurons and brain connections: Proteins, which are the little biological machines which make everything in your body work, combined with local electric potential do a lot of information processing on their own — no activation of the neuron required

  • Neurons change their genome dynamically to produce the right proteins to handle everyday information processing tasks. Brain: “Oh you are reading a blog. Wait a second, I just upregulate this reading-gene to help you understand the content of the blog better.” (This is an exaggeration — but it is not too far off)

My main response is a post I wrote earlier: Building brain-inspired AGI is infinitely easier than understanding the brain. To elaborate and summarize a bit:

  • Just because the brain does something in some bizarre, complicated, and inscrutable way, that doesn’t mean that it’s a very expensive calculation for a future AGI programmer. For example, biology makes oscillators by connecting up neurons into circuits, and these circuits are so complex and confusing that they have stumped generations of top neuroscientists. But if you want an oscillator in an AGI, no problem! It’s one line of C code: y = sin(ωt) !

  • If you model a transistor in great detail, it’s enormously complicated. Only a tiny piece of that complexity contributes to useful computations. By the same token, you can simulate the brain at arbitrary levels of detail, and (this being biology) you’ll find intricate, beautiful, publication-worthy complexity wherever you look. But that doesn’t mean that this complexity is playing an essential-for-AGI computational role in the system. (Or if it is, see previous bullet point.)

  • Humans can understand how rocket engines work. I just don’t see how some impossibly-complicated-Rube-Goldberg-machine of an algorithm can learn rocket engineering. There was no learning rocket engineering in the ancestral environment. There was nothing like learning rocket engineering in the ancestral environment!! Unless, of course, you take the phrase “like learning rocket engineering” to be so incredibly broad that even learning toolmaking, learning botany, learning animal-tracking, or whatever, are “like learning rocket engineering” in the algorithmically-relevant sense. And, yeah, that’s totally a good perspective to take! They do have things in common! “Patterns tend to recur.” “Things are often composed of other things.” “Patterns tend to be localized in time and space.” You get the idea. If your learning algorithm does not rely on any domain-specific assumptions beyond things like “patterns tend to recur” and “things are often composed of other things” or whatever, then just how impossibly complicated and intricate can the learning algorithm be, really? I just don’t see it.

    • (Of course, the learned model can be arbitrarily intricate and complicated. I’m talking about the learning algorithm here—that’s what is of primary interest for AGI timelines, I would argue.)

I don’t pretend that this is a rigorous argument, it’s intuitions knocking against each other. I’m open to discussion. :-)