Comments on “The Singularity is Nowhere Near”
I followed a link on Twitter to a fun and informative 2015 blog post by Tim Dettmers:
The Brain vs Deep Learning Part I: Computational Complexity — Or Why the Singularity Is Nowhere Near
The headline conclusion is that it takes at least FLOP/s to run the algorithms of a human brain, and therefore “it is unlikely that there will be a technological singularity in this century.” I disagree with that, and this post explores why.
(Specifically, I disagree with “at least FLOP/s”. There’s a separate step to go from “at least FLOP/s” to “it is unlikely that there will be a technological singularity in this century”—this step is related to Moore’s law, bandwidth requirements for parallelization, etc. Tim’s blog post has extensive discussion of this second step, and I won’t say anything about that here; I’d have to think about it more.)
(I’m writing this in 2021, six years later, but Tim has a comment on this very site that says he still stands by that post; in fact he now goes even further and says “I believe that AGI will be physically impossible with classical computers.”)
I highly recommend the original post. Indeed, if I didn’t like the post so much, I would not have bothered writing a response. :-)
Are brain algorithms computationally expensive to simulate?
Yes! Definitely! I think it’s especially telling that nobody has applied the Dileep George brain-inspired image-processing model to ImageNet, sticking to much smaller images with far fewer categories of objects (MNIST, CAPTCHAs etc.).
Likewise, this Randall O’Reilly paper has a fascinating computational exploration of (in my opinion) different and complementary aspects of the human visual system. That paper tests its theories on a set of ≈1000 256×256-pixel, 8-frame movies from 100 categories—compare to ImageNet’s 14 million images from 20,000 categories … or compare it to the number of visual categories that you can recognize. Training the model still took 512 InfiniBand-connected processor nodes running for ≈24 hours on their campus supercomputer (source: personal communication). The real human vision system is dramatically larger and more complicated than this model, and the whole brain is larger and more complicated still!
But, when I say “computationally expensive to simulate” above, I mean it in, like, normal-person-in-2021 standards of what’s computationally expensive to simulate. A very different question is whether the brain is “computationally expensive to simulate” by the standards of GPT-3, the standards of big tech data centers, the standards of “what will be feasible in 2030 or 2040 or 2050”, and things like that. There, I don’t have a strong opinion. I consider it an open question.
Note also that the two brain-inspired image-recognition examples just above are pushing innovative algorithms, and therefore are presumably handicapped by things like
there are probably variations & tweaks on the algorithms that make them work better and faster, and they have would not have been discovered yet,
nobody has designed new ASICs to run these algorithms efficiently—analogous to how GPU/TPUs are now designed around matrix multiplication and deep neural nets,
probably not much effort has gone into making the existing algorithms run optimally on existing hardware.
So anyway, the fact that a couple of today’s “most brain-like algorithms” (as judged by me) seem to be computationally expensive to scale up is not much evidence one way or the other for whether brain-like AGI algorithms would be “computationally expensive” with industrial-scale investment in the long-term or even short-term. Again, I consider it an open question.
Tim’s blog post argues that it is not an open question: his estimate is FLOP/s to run the algorithms of a human brain, which (he says) puts it out of reach for the century, and maybe (as in his recent comment) simply beyond what you can do with a classical computer. And he says that’s an underestimate!
This is quite a bit more skeptical than Joseph Carlsmith’s recent OpenPhil report “How Much Computational Power Does It Take to Match the Human Brain?”. That offers many estimation methods which come in at—FLOP/s, with being an extreme upper end.
What accounts for the discrepancy?
Where does Tim’s estimate of FLOP/s come from?
(Be warned that it’s very possible I’m misunderstanding something, and that I have zero experience simulating neurons. I’ve simulated lots of other things, and I’ve read about simulating neurons, but that’s different from actually making a neuron simulation with my own hands.)
Let’s just jump to the headline calculation:
.
Let’s go through the terms one by one.
8.6e10 is the 86 billion neurons in an adult brain. We’re going to calculate the FLOP/s to simulate a “typical” neuron and then multiply it by 86 billion. Of course, different neurons have different complexity to simulate—you can read Tim’s blog post for examples, including more specific calculations related to two particular neuron types in the cerebellum, but anyway, I’m on board so far.
200 is the number of times per second that you need to do a time-step of the simulation—in other words, the idea is that you want ~5ms time resolution. OK sure, that sounds like the right order of magnitude, as far as I know. I mean, I have the impression that better time-resolution than that is important for some processes, but on the other hand, you don’t necessarily need to do a fresh calculation from scratch each 5ms. Whatever, I dunno, let’s stick with the proposed 200Hz and move on.
10,000 is the number of synapses to a typical neuron. AI impacts says it should only be 2000-3000. Not sure what the deal is with that. Again it’s different for different neurons. Whatever, close enough.
The first “5” is an assumption that we need to do a separate floating-point operation involving the state of each synapse during each of the 5 most recent timesteps—in other words, to do the calculation for timestep N, the assumption here is that we need to do a separate operation involving the state of each synapse in timestep N, N-1, N-2, N-3, and N-4. I’ll get back to this.
The next 5×50 is the number of dendrite branches (50 dendrites, 5 branches per dendrite). If you don’t know, neurons get their inputs from dendrites, which (at least for some neurons) form a remarkably huge tree-like structure. (Of course, this being biology, you can always find crazy exceptions to every rule, like that one backwards neuron that sends outputs into its dendrites. But I digress.) Dendrite branches are important because of “dendritic spikes”, a spike that travels along a dendrite and its branches, also affects neighboring dendrites to some extent, and might or might not trigger a proper neuron spike that goes down the axon. See Tim’s post for more on this fascinating topic.
The last factor of 5 is, again, the assumption that there are slowly-dissipating effects such that to calculate what’s going on in timestep N, we need to do a separate operation involving the state of each branch in timestep N, N-1, N-2, N-3, and N-4.
So all in all, the implicit story behind multiplying these numbers together is:
Take each neuron A in each timestep B. Then take each synapse C on that neuron, and take each dendritic branch D on that neuron. Take one of the five most recent timesteps E for the synapse, and another one of the five most recent timesteps F for the dendritic branch. Now do at least one floating-point operation involving these particular ingredients, and repeat for all possible combinations.
I say “no way”. That just can’t be right, can it?
Let’s start with the idea of multiplying the number of synapses by the number of branches. So take a random synapse (synapse #49) and independently take a random branch of a random dendrite (branch #12). Most of the time the synapse is not that branch, and indeed not even on that dendrite! Why would we need to do a calculation specifically involving those two things?
If any influence can spread from a synapse way over here to a branch way over there, I think it would be the kind of thing that can be dealt with in a hierarchical calculation. Like, from the perspective of dendrite #6, you don’t need to know the fine-grained details of what’s happening in each individual synapse on dendrite #2; all you need to know is some aggregated measure of what’s going on at dendrite #2, e.g. whether it’s spiking, what mix of chemicals it’s dumping out into the soma, or whatever.
So I want to say that the calculation is not O(number of synapses × number of branches), but rather O(number of synapses) + O(number of branches). You do calculations for each synapse, then you do calculations for each branch (or each segment of each branch) that gradually aggregate the effects of those synapses over larger scales. Or something like that.
Next, the time model. I disagree with this too. Again, Tim is budgeting 5×5=25 operations per timestep to deal with time-history. The idea is that at timestep N, you’re doing a calculation involving “the state of synapse #18 in timestep (N-3) and of branch #59 in timestep (N-1)”, and a different calculation for (N-1) and (N-4), and yet another for (N-2) and (N), etc. etc. I don’t think that’s how it would work. Instead I imagine that you would track a bunch of state variables for the neuron, and update the state each timestep. Then your timestep calculation would input the previous state and what’s happening now, and would output the new state. So I think it should be a factor of order 1 to account for effects that are prolonged in time. Admittedly, you could say that the number “25″ is arguably “a factor of order 1”, but whatever. :-P
Oh, also, in a typical timestep, most synapses haven’t fired for the previous hundreds of milliseconds, so you get another order of magnitude or so reduction in computational cost from sparsity.
So put all that together, and now my back-of-the-envelope is like 50,000× lower than Tim’s.
(By the way, please don’t divide FLOP/s by 50,000 and call it “Steve’s estimate of the computational cost of brain simulations”. This is a negative case against the number, not a positive case for any model in particular. If you want my opinion, I don’t have one right now, as I said above. In the meantime I defer to the OpenPhil report.)
(Parts of this section are copying points made in the comment section of Tim’s blog.)
(Also, my favorite paper proposing an algorithmic purpose of dendritic spikes in cortical pyramidal neurons basically proposes that it functions as an awfully simple set of ANDs and ORs, more or less. I don’t read too much into that—I think the dendritic spikes are doing other computations too, which might or might not be more complicated. But I find that example suggestive.)
What about dynamic gene expression, axonal computations, subthreshold learning, etc.?
To be clear, Tim posited that the FLOP/s was an underestimate, because there were lots of other complications neglected by this model. Here’s a quote from his post:
Here is small list of a few important discoveries made in the last two years [i.e. 2013-2015] which increase the computing power of the brain by many orders of magnitude:
It was shown that brain connections rather than being passive cables, can themselves process information and alter the behavior of neurons in meaningful ways, e.g. brain connections help you to see the objects in everyday life. This fact alone increases brain computational complexity by several orders of magnitude
Neurons which do not fire still learn: There is much more going on than electrical spikes in neurons and brain connections: Proteins, which are the little biological machines which make everything in your body work, combined with local electric potential do a lot of information processing on their own — no activation of the neuron required
Neurons change their genome dynamically to produce the right proteins to handle everyday information processing tasks. Brain: “Oh you are reading a blog. Wait a second, I just upregulate this reading-gene to help you understand the content of the blog better.” (This is an exaggeration — but it is not too far off)
My main response is a post I wrote earlier: Building brain-inspired AGI is infinitely easier than understanding the brain. To elaborate and summarize a bit:
Just because the brain does something in some bizarre, complicated, and inscrutable way, that doesn’t mean that it’s a very expensive calculation for a future AGI programmer. For example, biology makes oscillators by connecting up neurons into circuits, and these circuits are so complex and confusing that they have stumped generations of top neuroscientists. But if you want an oscillator in an AGI, no problem! It’s one line of C code: y = sin(ωt) !
If you model a transistor in great detail, it’s enormously complicated. Only a tiny piece of that complexity contributes to useful computations. By the same token, you can simulate the brain at arbitrary levels of detail, and (this being biology) you’ll find intricate, beautiful, publication-worthy complexity wherever you look. But that doesn’t mean that this complexity is playing an essential-for-AGI computational role in the system. (Or if it is, see previous bullet point.)
Humans can understand how rocket engines work. I just don’t see how some impossibly-complicated-Rube-Goldberg-machine of an algorithm can learn rocket engineering. There was no learning rocket engineering in the ancestral environment. There was nothing like learning rocket engineering in the ancestral environment!! Unless, of course, you take the phrase “like learning rocket engineering” to be so incredibly broad that even learning toolmaking, learning botany, learning animal-tracking, or whatever, are “like learning rocket engineering” in the algorithmically-relevant sense. And, yeah, that’s totally a good perspective to take! They do have things in common! “Patterns tend to recur.” “Things are often composed of other things.” “Patterns tend to be localized in time and space.” You get the idea. If your learning algorithm does not rely on any domain-specific assumptions beyond things like “patterns tend to recur” and “things are often composed of other things” or whatever, then just how impossibly complicated and intricate can the learning algorithm be, really? I just don’t see it.
(Of course, the learned model can be arbitrarily intricate and complicated. I’m talking about the learning algorithm here—that’s what is of primary interest for AGI timelines, I would argue.)
I don’t pretend that this is a rigorous argument, it’s intuitions knocking against each other. I’m open to discussion. :-)
- Thoughts on hardware / compute requirements for AGI by 24 Jan 2023 14:03 UTC; 52 points) (
- 13 Jun 2024 18:21 UTC; 6 points) 's comment on LLMs won’t lead to AGI—Francois Chollet by (EA Forum;
I haven’t read the linked post/comment yet, and perhaps I am missing something very obvious, but: we have exaflop computing (that’s 10^18) right now. Is Tim Dettmers really saying that we’re not going to see a 1000x speed-up, in a century or possibly ever? That seems like a shocking claim, and I struggle to imagine what could justify it.
EDIT: I have now read the linked comment; it speaks of fundamental physical limitations such as speed of light, heat dissipation, etc., and says:
I do not find this convincing. Taking the outside view, we can see all sorts of similar predictions of limitations having been made over the course of computing history, and yet Moore’s Law is still going strong despite quite a few years of predictions of imminent trend-crashing. (Take a look at the “Recent trends” and “Alternative materials research” sections of the Wikipedia page; do you really see any indication that we’re about to hit a hard barrier? I don’t…)
Also, these physical limits – insofar as they are hard limits – are limits on various aspects of the impressiveness of the technology, but not on the cost of producing the technology. Learning-by-doing, economies of scale, process-engineering R&D, and spillover effects should still allow for costs to come down, even if the technology itself can hardly be improved.
It is fun to note that Metaculus is extremely uncertain about how many FLOPS will be required for AGI. The community lower 25% bound is 3.9x10^15 FLOPS and the upper 75% bound is 4.1x10^20 FLOPS with very flattish tails extending well beyond these bounds. (The median is 6.2e17.)
I mention this mainly to point out that his estimate of 10^21 FLOPS is simplify overconfident in his particular model. There are simple objections that should reduce confidence in that kind of extremely high estimate at least somewhat.
For example, the human brain runs on 20 watts of glucose-derived power, and is optimized to fit through a birth canal. These design constraints alone suggest that much of its architectural weirdness arises due to energy and size restrictions, not due to optimization on intelligence. Actually optimizing for intelligence with no power or size restrictions will yield intelligent structures that look very different, so different that it is almost pointless to use brains as a reference object.
Again, I think a healthy stance to take here isn’t “Tim Dettmers is WRONG” but rather “Tim Dettmers is overconfident.”
Tim Dettmers whole approach seems to be assuming that there are no computational shortcuts. No tricks that programmers can use for speed where evolution brute forced it. For example, maybe a part of the brain is doing a convolution by the straight forward brute force algorithm. And programmers can use fast fourier transform based convolutions. Maybe some neurons are discrete enough for us to use single bits. Maybe we can analyse the dimensions of the system and find that some are strongly attractive, and so just work in that subspace.
Of course, all this is providing an upper bound on the amount of compute needed to make a human level AI. Tim Dettmers is trying to prove it can’t be done. This needs a lower bound. To get a lower bound, don’t look at how long it takes a computer to simulate a human. Look at how long it takes a human to simulate a computer. This bound is really rather useless, compared to modern levels of compute. However, it might give us some rough idea how bad overhead can be. Suppose we thought “Compute needed to be at least as smart as a human” was uniformly distributed somewhere between “compute needed to simulate a human” and “compute a human can simulate”.
Well actually, it depends on what intelligence test we give. Human brains have been optimised towards (human stuff) so it probably takes more compute to socialize to a human level than it takes to solve integrals to a human level.
Interesting but probably irrelevant note.
There are subtleties in even the very loose lower bound of a human simulating a cpu. Suppose there was some currently unknown magic algorithm. This algorithm can hypothetically solve all sorts of really tricky problems in a handful of cpu cycles. It is so fast that a human mentally simulating a cpu running this algorithm will still beat current humans on a lot of important problems. (Not problems humans can solve too quickly, because no algorithm can do much in <1 clock cycle.) If such a magic algorithm exists, then its possible that even an AI running on a 1 operation per day computer could be arguably superhuman. Of course, I am somewhat doubtful that an algorithm that magic exists (although I have no strong evidence of non existence, some weak evidence namely that evolution didn’t find it and we haven’t found it yet.) Either way, we are far into the realm of instant takeoff on any computer.
If you swapped out “AGI” for “Whole Brain Emulation” then Tim Dettmers’ analysis becomes a lot more reasonable.
Tim is simply neglecting the obvious brute force solution to achieve brain-like capabilities. This is yet another startup and I’m not saying this approach will commercially succeed, but : [singularity hub]
The linked article is one on a startup called Cerebras who has gotten a ‘wafer scale engine’ to at least run in demos. This is where an entire silicon wafer is made into a large chip.
Enough of these, connected by hollow core optical fiber, would be what you need to hit that 10^21 threshold.
Also note that AI systems get a bunch of advantages that humans don’t have. Each system is immortal and is always doing it’s best. Human beings trivially make mistakes on simple tasks at high error rates—we do not “do our best” consistently 24/7/365. What does it mean to achieve human-like performance? Did you mean average performance or performance of the best human alive who is well rested?
Do you want broad spectrum capabilities or just the objects in imagenet? Because, again, it’s harder than it sounds to for a human to do better.
AI systems in applications like autonomous cars get to learn from the experiences of their peers in way that is not biased. Think about how biased the information you get from your peers is—for one thing, humans tend to only tell each other about successes, which can cause you to overestimate your chance of success for a risky venture like a startup.
While a peer autonomous vehicle can report in an unbiased way the (novel situation, true outcome) to a cloud farm that updates the learning to the fleet. Which is something each individual car doesn’t have to do—each vehicle doesn’t need to learn in itself.
In fact, here’s another flaw of Tim’s reasoning. He’s assuming we must have an AI system that learns in real time like a human does. This is not true—humans don’t learn in real time, either, it’s why we need 16-20 years of education to be useful.
Each AI system used in a field can give answers to questions in realtime, but record high prediction error results. This is sorta how OpenAI’s current algorithms already do it though I am neglecting details.
For a useful AI system used in a field, therefore, you need a tiny fraction of all the neurons a human uses—most are never going to contribute in any single task you might do as a human. And if a rare edge case shows up that needs more capability than a pared down, ‘sparse’ system used in a real application, you would have the field AI system pause it’s robotics and query a larger version of itself for the answer.
The more I type the more I realize how bullshit everything in this argument was. And there are efforts to make a silicon chip with more of the tradeoffs of the human brain. If you think you need power efficiency and breadth of capabilities more than accuracy, you can just do this. [an article on a startup that has built analog computers for neural network convolution. ]
So for Tim to be correct he needs to take into account a ‘best effort’ example of a large array of analog silicon processors, filling a whole warehouse, and conclude you cannot hit the computational needs required.
That startup is at about 300 TOPs for a single chip. Therefore, for a quick napkin estimate, that’s 10^14. It’s a startup making some of the first analog computers used in decades. So let’s assume there’s at least a power of 10 of “easy gains” leftover if this became a commercial technology. So 10^15.
10^21-10^15 = 6, or 1 million chips in a warehouse. Go to a ‘chiplet’ architecture to cram them into less packages, cram 10 per package, and you have 100,000 chips.
Current number 1 supercomputer is Fugaku with 158,976 48-core CPUs.
Cheap and easy if you had to do this next week? No, but it sounds like if enough resources available you could solve the problem even if we never get another improvement in silicon.