Brain efficiency matters a great deal for AGI timelines and takeoff speeds, as AGI is implicitly/explicitly defined in terms of brain parity. If the brain is about 6 OOM away from the practical physical limits of energy efficiency, then roughly speaking we should expect about 6 OOM of further Moore’s Law hardware improvement past the point of brain parity: perhaps two decades of progress at current rates, which could be compressed into a much shorter time period by an intelligence explosion—a hard takeoff.
But if the brain is already near said practical physical limits, then merely achieving brain parity in AGI at all will already require using up most of the optimizational slack, leaving not much left for a hard takeoff—thus a slower takeoff.
I guess your model is something like
Step 1: hardware with efficiency similar to the brain, Step 2: recursive self-improvement but only if Moore’s law hasn’t topped out yet by that point.
And therefore (on this model) it’s important to know if “efficiency similar to the brain” is close to the limits.
If that’s the model, I have some doubts. My model would be more like:
Step 1: algorithms with capability similar to the brain in some respects (which could have efficiency dramatically lower than the brain, because people are perfectly happy to run algorithms on huge banks of GPUs sucking down 100s of kW of electricity, etc.). Step 2: fast improvement of capability (maybe) via any of: 2A: Better code (vintage Recursive Self-Improvement, or “we got the secret sauce, now we pick the low-hanging fruit of making it work better”) 2B: More or better training (or “more time to learn and think” in the brain-like context [a.k.a. online-learning]) 2C: More hardware resources (more parameters, more chips, more instances, either because the programmers decide to after have promising results, or because the AGI is hacking into cloud servers or whatever).
Each of these might or might not happen, depending on whether there is scope for improvement that hasn’t already been squeezed out before step 1, which in turn depends on lots of things.
I didn’t even mention “2D: Better chips”, because it seems much slower than A,B,C. Normally I think of “hard takeoff” as being defined as “days, weeks, or at most months”, in which case fabricating new better chips seems unlikely to contribute.
I also agree with FeepingCreature’s comment that “the brain and today’s deep neural nets are comparably efficient at thus-and-such task” is pretty weak evidence that there isn’t some “neither-of-the-above” algorithm waiting to be discovered which is much more efficient than either. There might or might not be, I think it’s hard to say.
Well at the end I said “If we only knew the remaining secrets of the brain today, we could train a brain-sized model consisting of a small population of about 1000 agents/sims, running on about as many GPUs”
So my model absolutely is that we are limited by algorithmic knowledge. If we had that knowledge today we would be training AGI right now, because as this article indicates 1000 GPUs are already roughly powerful enough to simulate 1000 instances of a single shared brain-size ANN. Sure it may use a MW of power, or 1 kW per agent-instance, so 100x less efficient than the brain, but only 10x less efficient than the whole attached human body, and regardless that doesn’t matter much as human workers are 4 or 5 OOM more expensive than their equivalent raw energy cost.
I also agree with FeepingCreature’s comment that “the brain and today’s deep neural nets are comparably efficient at thus-and-such task” is pretty weak evidence that there isn’t some “neither-of-the-above” algorithm waiting to be discovered which is much more efficient than either. There might or might not be, I think it’s hard to say.
I don’t think it’s weak evidence at all, because of all the evidence we have of biological evolution achieving near optimality in so many other key efficiency metrics—at some point you just have to concede and update that biological evolution finds highly efficient or near-optimal solutions. The DL comparisons then show such and such amounts of technological evolution—a different search process—is converging on similar algorithms and limits. I find this rather compelling—and I’m genuinely curious as to why you don’t? (Perhaps one could argue that DL is too influenced by the brain? But we really did try many other approaches) I found FeepingCreature’s comment to be confused—as if he didn’t read the article (but perhaps I should make some sections more clear?).
About your intuition that evolution made brains optimal… well but then there are people like John von Neumann who clearly demonstrate that the human brain can be orders of magnitude more productive without significantly higher energy costs.
My model of the human brain isn’t that it’s the most powerful biological information processing organ possible—far from it. In my view of the world we are merely the first species that passed an intelligence treshold allowing it to produce a technological civilisation. As soon as a species passed that treshold civilisation popped into existence.
We are the dumbest species possible that still manages to coordinate and accumulate technology. This doesn’t tell you much about what the limits of biology are.
About your intuition that evolution made brains optimal… well but then there are people like John von Neumann who clearly demonstrate that the human brain can be orders of magnitude more productive without significantly higher energy costs.
Optimal is a word one should use with caution and always with respect to some measure, and I use it selectively, usually as ‘near-optimal’ or some such. The article does not argue that brains are ‘optimal’ in some generic sense. I referenced JVN just as an example of a mentat—that human brains are capable of learning more reasonably efficient numeric circuits, even though that’s well outside of evolutionary objectives. JVN certainly isn’t the only example of a human mentat like that, and he certainly isn’t evidence “that the human brain can be orders of magnitude more productive”.
We are the dumbest species possible that still manages to coordinate and accumulate technology. This doesn’t tell you much about what the limits of biology are.
Sure, I agree your stated “humans first to cross the finish line” model (or really EY’s) doesn’t tell you much about the limits of biology. To understand the actual limits of biology, you have to identify what those actual physical limits are, and then evaluate how close brains are to said limits. That is in fact what this article does.
In my view of the world we are merely the first species that passed an intelligence treshold allowing it to produce a technological civilisation. As soon as a species passed that treshold civilisation popped into existence.
We passed the threshold for language. We passed the threshold from evolutionarily specific intelligence to universal Turing Machines style intelligence through linguistic mental programs/programming. Before that everything a big brain learns during a lifetime is lost, after that it allowed for immortal substrate independent mental programs to evolve separately from the disposable brain soma: cultural/memetic evolution. This is a one time major phase shift in evolution, not some specific brain adaptation (even though some of the latter obviously enables the former).
For example, if there were an image-processing algorithm that used many fewer operations overall, but where those operations were more serial and less parallel—e.g. it required 1000 sequential steps for each image—then I think evolution would not have found it, because brains are too slow.
So then you need a different reason to think that such an algorithm doesn’t exist.
Maybe you can say “If such an algorithm existed, AI researchers would have found it by now.” But would they really? If AI researchers hadn’t been stealing ideas from the brain, would they have even invented neural nets by now? I dunno.
Or you can say “Something about the nature of image processing is that doing 1000 sequential steps just isn’t that useful for the task.” I guess I find that claim kinda plausible, but I’m just not very confident, I don’t feel like I have such a deep grasp of the fundamental nature of image processing that I can make claims like that.
In other domains besides image processing, I’d be even less confident. For example, I can kinda imagine some slightly-alien form of “reasoning” or “planning” that was mostly like human “reasoning” or “planning” but sometimes involved fast serial operations. After all, I find it very handy to have a fast serial laptop. If access to fast serial processing is useful for “me”, maybe it would be also useful for the low-level implementation of my brain algorithms. I dunno. Again, I think it’s hard to say either way.
For example, if there were an image-processing algorithm that used many fewer operations overall, but where those operations were more serial and less parallel—e.g. it required 1000 sequential steps for each image—then I think evolution would not have found it, because brains are too slow.
EDIT: I updated the circuits section of the article with an improved model of Serial vs Parallel vs Neurmorphic(PIM) scalability, which better illustrates how serial computation doesn’t scale.
Yes you bring up a good point, and one I should have discussed in more detail (but the article is already pretty long). However the article does provide part of the framework to answer this question.
There definitely are serial/parallel tradeoffs where the parallel version of an algorithm tends to use marginally more compute asymptotically. However these simple big O asymptotic models do not consider the fundamental costs of wire energy transit for remote memory accesses, which actually scale as M(1/2) for 2D memory. So in that sense the simple big O models are asymptotically wrong. If you use the correct more detailed models which account for the actual wire energy costs, everything changes, and the parallel versions leveraging distributed local memory and thus avoiding wire energy transit are generally more energy efficient—but through using a more memory heavy algorithmic approach.
Another way of looking at it is to compare serial-optimized VN processors (CPUs) vs parallel-optimized VN processors (GPUs), vs parallel processor-in-memory (brains, neuromorphic).
Pure serial CPUs (ignoring parallel/vector instructions) with tens of billions of transistors have only order a few dozen cores but not much higher clock rates than GPUs, despite using all that die space for marginal serial speed increase—serial speed scales extremely poorly with transistor density, end of dennard scaling, etc. A GPU with tens of billions of transistors instead has tens of thousands of ALU cores, but is still ultimately limited by very slow poor scaling of off-chip RAM bandwidth proportional to N0.5 (where N is device area), and wire energy that doesn’t scale at all. The neuromorphic/PIM machine has perfect mem bandwidth scaling at 1:1 ratio—it can access all of it’s RAM per clock cycle, pays near zero energy to access RAM (as memory and compute are unified), and everything scales linear with N.
Physics is fundamentally parallel, not serial, so the latter just doesn’t scale.
But of course on top of all that there is latency/delay—so for example the brain is also strongly optimized for minimal depth for minimal delay, and to some extent that may compete with optimizing for energy. Ironically delay is also a problem in GPU ANNs—huge problem for tesla’s self driving cars for example—because GPUs need to operate on huge batches to amortize their very limited/expensive memory bandwidth.
Yeah, latency / depth is the main thing I was thinking of.
If my boss says “You must calculate sin(x) in 2 clock cycles”, I would have no choice but to waste a ton of memory on a giant lookup table. (Maybe “2″ is the wrong number of clock cycles here, but you get the idea.) If I’m allowed 10 clock cycles, maybe I can reduce x mod 2π first, and thus use a much smaller lookup table, thus waste a lot less memory. If I’m allowed 200 clock cycles to calculate sin(x), I can use C code that has no lookup table at all, and thus roughly zero memory and communications. (EDIT: Oops, LOL, the C code I linked uses a lookup table. I could have linked this one instead.)
So I still feel like I don’t want to take it for granted that there’s a certain amount of “algorithmic work” that needs to be done for “intelligence”, and that amount of “work” is similar to what the human brain uses. I feel like there might be potential algorithmic strategies out there that are just out of the question for the human brain, because of serial depth. (Among other reasons.)
Also, it’s not all-or-nothing: I can imagine an AGI that involves a big parallel processor, and a small fast serial coprocessor. Maybe there are little pieces of the algorithm that would massively benefit from serialization, and the brain is bottlenecked in capability (or wastes memory / resources) by the need to find workarounds for those pieces. Or maybe not, who knows.
in which case fabricating new better chips seems unlikely to contribute.
Fabricating new better chips will be part of a Foom once the AI has nanotech. This might be because humans had already made nanotech by this point, or it might involve using a DNA printer to make nanotech in a day. (The latter requires a substantial amount of intelligence already, so this is a process that probably won’t kick in the moment the AI gets to about human level. )
Nice post!
I guess your model is something like
Step 1: hardware with efficiency similar to the brain,
Step 2: recursive self-improvement but only if Moore’s law hasn’t topped out yet by that point.
And therefore (on this model) it’s important to know if “efficiency similar to the brain” is close to the limits.
If that’s the model, I have some doubts. My model would be more like:
Step 1: algorithms with capability similar to the brain in some respects (which could have efficiency dramatically lower than the brain, because people are perfectly happy to run algorithms on huge banks of GPUs sucking down 100s of kW of electricity, etc.).
Step 2: fast improvement of capability (maybe) via any of:
2A: Better code (vintage Recursive Self-Improvement, or “we got the secret sauce, now we pick the low-hanging fruit of making it work better”)
2B: More or better training (or “more time to learn and think” in the brain-like context [a.k.a. online-learning])
2C: More hardware resources (more parameters, more chips, more instances, either because the programmers decide to after have promising results, or because the AGI is hacking into cloud servers or whatever).
Each of these might or might not happen, depending on whether there is scope for improvement that hasn’t already been squeezed out before step 1, which in turn depends on lots of things.
I didn’t even mention “2D: Better chips”, because it seems much slower than A,B,C. Normally I think of “hard takeoff” as being defined as “days, weeks, or at most months”, in which case fabricating new better chips seems unlikely to contribute.
I also agree with FeepingCreature’s comment that “the brain and today’s deep neural nets are comparably efficient at thus-and-such task” is pretty weak evidence that there isn’t some “neither-of-the-above” algorithm waiting to be discovered which is much more efficient than either. There might or might not be, I think it’s hard to say.
Well at the end I said “If we only knew the remaining secrets of the brain today, we could train a brain-sized model consisting of a small population of about 1000 agents/sims, running on about as many GPUs”
So my model absolutely is that we are limited by algorithmic knowledge. If we had that knowledge today we would be training AGI right now, because as this article indicates 1000 GPUs are already roughly powerful enough to simulate 1000 instances of a single shared brain-size ANN. Sure it may use a MW of power, or 1 kW per agent-instance, so 100x less efficient than the brain, but only 10x less efficient than the whole attached human body, and regardless that doesn’t matter much as human workers are 4 or 5 OOM more expensive than their equivalent raw energy cost.
I don’t think it’s weak evidence at all, because of all the evidence we have of biological evolution achieving near optimality in so many other key efficiency metrics—at some point you just have to concede and update that biological evolution finds highly efficient or near-optimal solutions. The DL comparisons then show such and such amounts of technological evolution—a different search process—is converging on similar algorithms and limits. I find this rather compelling—and I’m genuinely curious as to why you don’t? (Perhaps one could argue that DL is too influenced by the brain? But we really did try many other approaches) I found FeepingCreature’s comment to be confused—as if he didn’t read the article (but perhaps I should make some sections more clear?).
About your intuition that evolution made brains optimal… well but then there are people like John von Neumann who clearly demonstrate that the human brain can be orders of magnitude more productive without significantly higher energy costs.
My model of the human brain isn’t that it’s the most powerful biological information processing organ possible—far from it. In my view of the world we are merely the first species that passed an intelligence treshold allowing it to produce a technological civilisation. As soon as a species passed that treshold civilisation popped into existence.
We are the dumbest species possible that still manages to coordinate and accumulate technology. This doesn’t tell you much about what the limits of biology are.
Optimal is a word one should use with caution and always with respect to some measure, and I use it selectively, usually as ‘near-optimal’ or some such. The article does not argue that brains are ‘optimal’ in some generic sense. I referenced JVN just as an example of a mentat—that human brains are capable of learning more reasonably efficient numeric circuits, even though that’s well outside of evolutionary objectives. JVN certainly isn’t the only example of a human mentat like that, and he certainly isn’t evidence “that the human brain can be orders of magnitude more productive”.
Sure, I agree your stated “humans first to cross the finish line” model (or really EY’s) doesn’t tell you much about the limits of biology. To understand the actual limits of biology, you have to identify what those actual physical limits are, and then evaluate how close brains are to said limits. That is in fact what this article does.
We passed the threshold for language. We passed the threshold from evolutionarily specific intelligence to universal Turing Machines style intelligence through linguistic mental programs/programming. Before that everything a big brain learns during a lifetime is lost, after that it allowed for immortal substrate independent mental programs to evolve separately from the disposable brain soma: cultural/memetic evolution. This is a one time major phase shift in evolution, not some specific brain adaptation (even though some of the latter obviously enables the former).
For example, if there were an image-processing algorithm that used many fewer operations overall, but where those operations were more serial and less parallel—e.g. it required 1000 sequential steps for each image—then I think evolution would not have found it, because brains are too slow.
So then you need a different reason to think that such an algorithm doesn’t exist.
Maybe you can say “If such an algorithm existed, AI researchers would have found it by now.” But would they really? If AI researchers hadn’t been stealing ideas from the brain, would they have even invented neural nets by now? I dunno.
Or you can say “Something about the nature of image processing is that doing 1000 sequential steps just isn’t that useful for the task.” I guess I find that claim kinda plausible, but I’m just not very confident, I don’t feel like I have such a deep grasp of the fundamental nature of image processing that I can make claims like that.
In other domains besides image processing, I’d be even less confident. For example, I can kinda imagine some slightly-alien form of “reasoning” or “planning” that was mostly like human “reasoning” or “planning” but sometimes involved fast serial operations. After all, I find it very handy to have a fast serial laptop. If access to fast serial processing is useful for “me”, maybe it would be also useful for the low-level implementation of my brain algorithms. I dunno. Again, I think it’s hard to say either way.
Peter Watts would like you to ponder how Portia spiders think about what they see. :)
Is that link safe to click for someone with Arachnophobia?
no pictures
Yes. Photos are a lot of work to include, and anyway, jumping spiders are famously cute (as far as spiders go).
I wish the cuteness made a difference. Interesting reading though, thanks.
EDIT: I updated the circuits section of the article with an improved model of Serial vs Parallel vs Neurmorphic(PIM) scalability, which better illustrates how serial computation doesn’t scale.
Yes you bring up a good point, and one I should have discussed in more detail (but the article is already pretty long). However the article does provide part of the framework to answer this question.
There definitely are serial/parallel tradeoffs where the parallel version of an algorithm tends to use marginally more compute asymptotically. However these simple big O asymptotic models do not consider the fundamental costs of wire energy transit for remote memory accesses, which actually scale as M(1/2) for 2D memory. So in that sense the simple big O models are asymptotically wrong. If you use the correct more detailed models which account for the actual wire energy costs, everything changes, and the parallel versions leveraging distributed local memory and thus avoiding wire energy transit are generally more energy efficient—but through using a more memory heavy algorithmic approach.
Another way of looking at it is to compare serial-optimized VN processors (CPUs) vs parallel-optimized VN processors (GPUs), vs parallel processor-in-memory (brains, neuromorphic).
Pure serial CPUs (ignoring parallel/vector instructions) with tens of billions of transistors have only order a few dozen cores but not much higher clock rates than GPUs, despite using all that die space for marginal serial speed increase—serial speed scales extremely poorly with transistor density, end of dennard scaling, etc. A GPU with tens of billions of transistors instead has tens of thousands of ALU cores, but is still ultimately limited by very slow poor scaling of off-chip RAM bandwidth proportional to N0.5 (where N is device area), and wire energy that doesn’t scale at all. The neuromorphic/PIM machine has perfect mem bandwidth scaling at 1:1 ratio—it can access all of it’s RAM per clock cycle, pays near zero energy to access RAM (as memory and compute are unified), and everything scales linear with N.
Physics is fundamentally parallel, not serial, so the latter just doesn’t scale.
But of course on top of all that there is latency/delay—so for example the brain is also strongly optimized for minimal depth for minimal delay, and to some extent that may compete with optimizing for energy. Ironically delay is also a problem in GPU ANNs—huge problem for tesla’s self driving cars for example—because GPUs need to operate on huge batches to amortize their very limited/expensive memory bandwidth.
Yeah, latency / depth is the main thing I was thinking of.
If my boss says “You must calculate sin(x) in 2 clock cycles”, I would have no choice but to waste a ton of memory on a giant lookup table. (Maybe “2″ is the wrong number of clock cycles here, but you get the idea.) If I’m allowed 10 clock cycles, maybe I can reduce x mod 2π first, and thus use a much smaller lookup table, thus waste a lot less memory. If I’m allowed 200 clock cycles to calculate sin(x), I can use C code that has no lookup table at all, and thus roughly zero memory and communications. (EDIT: Oops, LOL, the C code I linked uses a lookup table. I could have linked this one instead.)
So I still feel like I don’t want to take it for granted that there’s a certain amount of “algorithmic work” that needs to be done for “intelligence”, and that amount of “work” is similar to what the human brain uses. I feel like there might be potential algorithmic strategies out there that are just out of the question for the human brain, because of serial depth. (Among other reasons.)
Also, it’s not all-or-nothing: I can imagine an AGI that involves a big parallel processor, and a small fast serial coprocessor. Maybe there are little pieces of the algorithm that would massively benefit from serialization, and the brain is bottlenecked in capability (or wastes memory / resources) by the need to find workarounds for those pieces. Or maybe not, who knows.
Fabricating new better chips will be part of a Foom once the AI has nanotech. This might be because humans had already made nanotech by this point, or it might involve using a DNA printer to make nanotech in a day. (The latter requires a substantial amount of intelligence already, so this is a process that probably won’t kick in the moment the AI gets to about human level. )