Either EY believes that the brain is 6 OOM from the efficiency limits for conventional irreversible computers—in which case he is mistaken—or he agrees with me that the brain is close to the practical limits for conventional computers and he was instead specifically talking about reversible computation (an interpretation I find unlikely) - in which case he agrees with that component of my argument with all the implications: that his argument for fast foom now can’t easily take advantage of nanotech assemblers for 6 OOM compute advantage, that the brain is actually efficient given its constraints, which implies by association that brain software is much more efficient as it was produced by exactly the same evolutionary process which he now admits produced fully optimized conventional computational elements over the same time frame, etc.
To be clear, if I understand you correctly, the easier path to getting most of the 6 OOMs is through optical interconnect or superconducting interconnect, not via making the full jump to reversible computation (though that also doesn’t seem impossible. Moving all of it over seems hard, but you can maybe find some way to get a core computation like matrix multiplies into it, but I really haven’t thought much about this and this take might be really dumb).
I mean, the easiest solution is just “make it smaller and use active cooling”. The relevant loopholes in Jacob’s argument are in the Density and Temperature section of his Brain Efficiency post.
Jacob is using a temperature formula for blackbody radiators, which is basically just irrelevant to temperature of realistic compute substrate—brains, chips, and probably future compute substrates are all cooled by conduction through direct contact with something cooler (blood for the brain, heatsink/air for a chip). The obvious law to use instead would just be the standard thermal conduction law: heat flow per unit area proportional to temperature gradient.
Jacob’s analysis in that section also fails to adjust for how, by his own model in the previous section, power consumption scales linearly with system size (and also scales linearly with temperature).
Put all that together, and we get:
qA=C1TSRR2=C2(TS−TE)R
… where:
R is radius of the system
A is surface area of thermal contact
q is heat flow out of system
TS is system temperature
TE is environment temperature (e.g. blood or heat sink temperature)
C1,C2 are constants with respect to system size and temperature
(Of course a spherical approximation is not great, but we’re mostly interested in change as all the dimensions scale linearly, so the geometry shouldn’t matter for our purposes.)
First key observation: all the R’s cancel out. If we scale down by a factor of 2, the power consumption is halved (since every wire is half as long), the area is quartered (so power density over the surface is doubled), and the temperature gradient is doubled since the surface is half as thick. So, overall, equilibrium temperature stays the same as the system scales down.
So in fact scaling down is plausibly free, for purposes of heat management. (Though I’m not highly confident that would work in practice. In particular, I’m least confident about the temperature gradient scaling with system size, in practice. If that failed, then the temperature delta relative to the environment would scale at-worst ~linearly with inverse size, i.e. halving the size would double the temperature delta.)
On top of that, we could of course just use a colder environment, i.e. pump liquid nitrogen or even liquid helium over the thing. According to this meta-analysis, the average temperature delta between e.g. brain and blood is at most ~2.5 C, so even liquid nitrogen would be enough to achieve ~100x larger temperature delta if the system were at the same temperature as the brain; we don’t even need to go to liquid helium for that.
In terms of scaling, our above formula says that TS will scale proportionally to TE. Halve the environment temperature, halve the system temperature. And that result I do expect to be pretty robust (for systems near Jacob’s interconnect Landauer limit), since it just relies on temperature scaling of the Landauer limit plus heat flow being proportional to temperature delta.
I mean, the easiest solution is just “make it smaller and use active cooling”.
The brain already uses active liquid cooling of course, so this is just make it smaller and cool it harder.
I have not had time to investigate your claimed physics on how cooling scales, but I”m skeptical—pumping a working coolant through the compute volume can only extract a limited constant amount of heat from the volume per unit of coolant flowing per time step (this should be obvious?), and thus the amount of heat that can be removed must scale strictly with the surface area (assuming that you’ve already maxed out the cooling effect per unit coolant).
So reduce radius by 2x and you reduce surface area and thus heat pumped out by 4x, but only reduce heat production via reducing wire length by at most 2x as I described in the article.
Active cooling ends up using more energy as you are probably aware. Moving to a colder environment is of course feasible (and used to some extent by some datacenters), but that hardly gets OOM gains on earth.
Well to be clear there is no easy path to 6 OOM in further energy efficiency improvement. At a strictly trends-prediction level that is of same order as the gap between a 286 and an nvidia RTX 4090, which took 40 years of civilization level effort. At a circuit theory level the implied ~1e15/s analog synaptic ops in 1e-5J is impossible without full reversible computing, as interconnect is only ~90% of the energy cost, not 99.999%, and the minimal analog or digital MAC op consumes far more than 0.1eV. So not only can it not even run conventional serial algorithms or massively parallel algorithms, it has to use fully reversible parallel logic. Like quantum computing, its still unclear what maps usefully to that paradigm I’m reasonably optimistic in the long term but ..
I’m skeptical that even the implied error bit correction rate energy costs would make much sense on the surface of the earth. An advanced quantum or reversible computer’s need for minimal noise and thus temperature to maintain coherence or low error rate is just a symptom of reaching highly perfected states of matter, where any tiny atomic disturbance can be catastrophic and cause a cascade of expensive-to-erase errors. Ironically such a computer would likely be much larger than the brain—this appears to be one of the current fundemental tradeoffs with most reversible computation, it’s not a simple free lunch (optical computers are absolutely enormous, superconducting circuits are large, reversibility increases area, etc) . At scale such systems would probably only work well off earth, perhaps far from the sun or buried in places like the darkside of the moon, because they become extremely sensitive to thermal noise, cosmic rays, and any disorder. We are talking about arcilect level tech in 2048 or something, not anything near term.
So instead I expect we’ll have a large population of neurmorphic AGI/uploads well before that.
which implies by association that brain software is much more efficient as it was produced by exactly the same evolutionary process which he now admits produced fully optimized conventional computational elements over the same time frame, etc
I don’t believe this would follow; we actually have much stronger evidence that ought to screen off that sort of prior—simply the relatively large differences in human cognitive abilities.
Evolution optimizes population distributions with multiple equilibria and niches; large diversity in many traits are expected especially for highly successful species.
Furthermore what current civilization considers to be useful cognitive abilities often have costs—namely in longer neotany training periods—which don’t always pay off vs quicker to breeding strategies.
There seems to be much more diversity in human cognitive performance than there is in human-brain-energy-efficiency; whether this is due to larger differences in the underlying software (to the extent that this is meaningfully commensurable with differences in hardware) or because smaller differences in that domain result in much larger differences in observable outputs, or both, none of that really takes away from the fact that brain software does not seem to be anywhere near the relevant efficiency frontier, especially since many trade-offs which were operative at an evolutionary scale simply aren’t when it comes to software.
Human mind software evolves at cultural speeds so its recent age isn’t comparably relevant. Diversity in human cognitive capabilities results from the combined oft multiplicative effects of brain hardware differences compounding unique training datasets/experiences.
Its well known in DL that you can’t really compare systems trained on vary different datasets, but that is always the case with humans.
The trained model can change at cultural speeds but the human neural network architecture, hyperparameters, and reward functions can’t. (Or do you think those are already close to optimal for doing science & technology or executing complicated projects etc.?)
(Whatever current or future chips we wind up using for AGI, we’ll almost definitely be able to change the architecture, hyperparameters, and reward functions without fabricating new chips. So I count those as “software not hardware”. I’m unsure how you’re defining those terms.)
Relatedly, if we can run a brain-like algorithm on computer chips at all (or (eventually) use synth-bio to grow brains in vats, or whatever), then we can increase the number of cortical columns / number of neurons / whatever to be 3× more than any human, and hence (presumably) we would get an AI that would be dramatically more insightful than any human who has ever existed. Specifically, it could hold a far richer and more complicated thought in working memory, whereas humans would have to chunk it and explore it sequentially, which makes it harder to notice connections / analogies / interactions between the parts. I’m unclear on how you’re thinking about things like that. It seems pretty important on my models.
The trained model can change at cultural speeds but the human neural network architecture, hyperparameters, and reward functions can’t. (Or do you think those are already close to optimal for doing science & technology or executing complicated projects etc.?)
You can’t easily completely change either the ANN or BNN architecture/hyperparams after training, as the weights you invest so much compute in learning are largely dependent on those decisions—and actually the architecture is just equivalent to weights. Sure there are ways to add new modules later or regraft things, but that is a very limited scope for improving the already trained modules.
As to your second question—no I don’t think there are huge gains over the human brain arch. In part because the initial arch doesn’t/shouldn’t matter that much. If it does then it wasn’t flexible enough in the first place. One of the key points of my ULM post was pointing out how the human brain—unlike current DL systems—learns the architecture during training through high level sparse wiring patterns. “Architecture” is largely just wiring patterns, and in huge flexible network you can learn architecture.
Relatedly, if we can run a brain-like algorithm on computer chips at all (or (eventually) use synth-bio to grow brains in vats, or whatever), then we can increase the number of cortical columns / number of neurons / whatever to be 3× more than any human,
Sure, but the human brain is already massive and far off chinchilla scaling. It seems much better currently to use your compute/energy budget on running a smaller model much faster (to learn more quickly).
Specifically, it could hold a far richer and more complicated thought in working memory, whereas humans would have to chunk it and explore it sequentially,
GPT4 probably doesn’t have that same working memory limitation baked into its architecture but it doesn’t seem to matter much. I guess its possible it learns that limitation to imitate humans, but regardless I don’t see much evidence that the human working memory limitation is all that constraining.
Sure, but the human brain is already massive and far off chinchilla scaling. It seems much better currently to use your compute/energy budget on running a smaller model much faster (to learn more quickly).
I thought your belief was that the human brain is a scaled up chimp brain, right? If so:
If I compare “one human” versus “lots of chimps working together and running at super-speed”, in terms of ability to do science & technology, the former would obviously absolutely crush the latter.
…So by the same token, if I compare “one model that’s like a 3×-scaled-up human brain” to “lots of models that are like normal (non-scaled-up) human brains, working together and running at super-speed”, in terms of ability to do science & technology, it should be at least plausible that the former would absolutely crush the latter, right?
First the human brain uses perhaps 10x the net effective training compute (3x size, 2x neotany extending training of higher modules, a bit from arch changes), and scale alone leads to new capabilities.
But the main key new capability was the evolution of language, and the resulting cultural revolution. Chimps train on 1e8s of lifetime data or so, and that’s it. Humans train on 1e9s, but that 1e9s dataset is a compression of all the experience of all humans who have ever lived. So the effective dataset size scales with human population size vs being constant, and even a sublinear scaling with population size leads to a radically different regime. The most important inventions driving human civilization progress indirectly or directly drive up that scaling factor.
OK, so in your picture chimps had less training / less scale / worse arch than humans, and this is related to the fact that humans have language and chimps don’t. “Scale alone leads to new capabilities.”
But if we explore the regime of “even more training than humans / even more scale than humans / even better arch than humans”, your claim is that this whole regime is just a giant dead zone where nothing interesting happens, and thus you’re just being inefficient—really you should have split it into multiple smaller models. Correct? If so, why do you think that?
In other words, if scaling up from chimp brains to human brains unlocked new capabilities (namely language), why shouldn’t scaling up from human brains to superhuman brains unlock new capabilities too? Do you think there are no capabilities left, or something?
(Sorry if you’ve already talked about this elsewhere.)
OK, so in your picture chimps had less training / less scale / worse arch than humans, and this is related to the fact that humans have language and chimps don’t. “Scale alone leads to new capabilities.”
Scale in compute and data—as according to NN scaling laws. The language/culture/tech leading to new effective data scaling regime quickly reconfigured the pareto surface payoff for brain size, so its more of a feedback loop rather than a clear cause effect (which is why I would consider it a foom in terms of evolutionary timescales).
In other words, if scaling up from chimp brains to human brains unlocked new capabilities (namely language), why shouldn’t scaling up from human brains to superhuman brains unlock new capabilities too?
Of course, but the new capabilities are more like new skills, mental programs, and wisdom not metasystems transitions (changes to core scaling regime).
A metasystems transition would be something as profound, rare, and as important as transitioning from effective lifetime training data being a constant to effective lifetime data scaling with population size, or transitioning from non-programmable to programmable.
Zoom in and look at what a large NN is for—what does it do? It can soak up more data to acquire more knowledge/skills, and it also learns faster per timestep (as it’s searching in parallel over a wider circuit space per time step), but the latter is already captured in net training compute anyway. So intelligence is mostly about the volume of search space explored, which scales with net training compute—this is almost an obvious direct consequence of Solomon induction or derivation thereof.
I am not arguing that there are no more metasystems transitions, only that “make brains bigger” doesn’t automatically enable them. The single largest impact of digital minds is probably just speed. Not energy efficiency or software efficiency, just raw speed.
Either EY believes that the brain is 6 OOM from the efficiency limits for conventional irreversible computers—in which case he is mistaken—or he agrees with me that the brain is close to the practical limits for conventional computers and he was instead specifically talking about reversible computation (an interpretation I find unlikely) - in which case he agrees with that component of my argument with all the implications: that his argument for fast foom now can’t easily take advantage of nanotech assemblers for 6 OOM compute advantage, that the brain is actually efficient given its constraints, which implies by association that brain software is much more efficient as it was produced by exactly the same evolutionary process which he now admits produced fully optimized conventional computational elements over the same time frame, etc.
To be clear, if I understand you correctly, the easier path to getting most of the 6 OOMs is through optical interconnect or superconducting interconnect, not via making the full jump to reversible computation (though that also doesn’t seem impossible. Moving all of it over seems hard, but you can maybe find some way to get a core computation like matrix multiplies into it, but I really haven’t thought much about this and this take might be really dumb).
I mean, the easiest solution is just “make it smaller and use active cooling”. The relevant loopholes in Jacob’s argument are in the Density and Temperature section of his Brain Efficiency post.
Jacob is using a temperature formula for blackbody radiators, which is basically just irrelevant to temperature of realistic compute substrate—brains, chips, and probably future compute substrates are all cooled by conduction through direct contact with something cooler (blood for the brain, heatsink/air for a chip). The obvious law to use instead would just be the standard thermal conduction law: heat flow per unit area proportional to temperature gradient.
Jacob’s analysis in that section also fails to adjust for how, by his own model in the previous section, power consumption scales linearly with system size (and also scales linearly with temperature).
Put all that together, and we get:
qA=C1TSRR2=C2(TS−TE)R
… where:
R is radius of the system
A is surface area of thermal contact
q is heat flow out of system
TS is system temperature
TE is environment temperature (e.g. blood or heat sink temperature)
C1,C2 are constants with respect to system size and temperature
(Of course a spherical approximation is not great, but we’re mostly interested in change as all the dimensions scale linearly, so the geometry shouldn’t matter for our purposes.)
First key observation: all the R’s cancel out. If we scale down by a factor of 2, the power consumption is halved (since every wire is half as long), the area is quartered (so power density over the surface is doubled), and the temperature gradient is doubled since the surface is half as thick. So, overall, equilibrium temperature stays the same as the system scales down.
So in fact scaling down is plausibly free, for purposes of heat management. (Though I’m not highly confident that would work in practice. In particular, I’m least confident about the temperature gradient scaling with system size, in practice. If that failed, then the temperature delta relative to the environment would scale at-worst ~linearly with inverse size, i.e. halving the size would double the temperature delta.)
On top of that, we could of course just use a colder environment, i.e. pump liquid nitrogen or even liquid helium over the thing. According to this meta-analysis, the average temperature delta between e.g. brain and blood is at most ~2.5 C, so even liquid nitrogen would be enough to achieve ~100x larger temperature delta if the system were at the same temperature as the brain; we don’t even need to go to liquid helium for that.
In terms of scaling, our above formula says that TS will scale proportionally to TE. Halve the environment temperature, halve the system temperature. And that result I do expect to be pretty robust (for systems near Jacob’s interconnect Landauer limit), since it just relies on temperature scaling of the Landauer limit plus heat flow being proportional to temperature delta.
The brain already uses active liquid cooling of course, so this is just make it smaller and cool it harder.
I have not had time to investigate your claimed physics on how cooling scales, but I”m skeptical—pumping a working coolant through the compute volume can only extract a limited constant amount of heat from the volume per unit of coolant flowing per time step (this should be obvious?), and thus the amount of heat that can be removed must scale strictly with the surface area (assuming that you’ve already maxed out the cooling effect per unit coolant).
So reduce radius by 2x and you reduce surface area and thus heat pumped out by 4x, but only reduce heat production via reducing wire length by at most 2x as I described in the article.
Active cooling ends up using more energy as you are probably aware. Moving to a colder environment is of course feasible (and used to some extent by some datacenters), but that hardly gets OOM gains on earth.
Well to be clear there is no easy path to 6 OOM in further energy efficiency improvement. At a strictly trends-prediction level that is of same order as the gap between a 286 and an nvidia RTX 4090, which took 40 years of civilization level effort. At a circuit theory level the implied ~1e15/s analog synaptic ops in 1e-5J is impossible without full reversible computing, as interconnect is only ~90% of the energy cost, not 99.999%, and the minimal analog or digital MAC op consumes far more than 0.1eV. So not only can it not even run conventional serial algorithms or massively parallel algorithms, it has to use fully reversible parallel logic. Like quantum computing, its still unclear what maps usefully to that paradigm I’m reasonably optimistic in the long term but ..
I’m skeptical that even the implied error bit correction rate energy costs would make much sense on the surface of the earth. An advanced quantum or reversible computer’s need for minimal noise and thus temperature to maintain coherence or low error rate is just a symptom of reaching highly perfected states of matter, where any tiny atomic disturbance can be catastrophic and cause a cascade of expensive-to-erase errors. Ironically such a computer would likely be much larger than the brain—this appears to be one of the current fundemental tradeoffs with most reversible computation, it’s not a simple free lunch (optical computers are absolutely enormous, superconducting circuits are large, reversibility increases area, etc) . At scale such systems would probably only work well off earth, perhaps far from the sun or buried in places like the darkside of the moon, because they become extremely sensitive to thermal noise, cosmic rays, and any disorder. We are talking about arcilect level tech in 2048 or something, not anything near term.
So instead I expect we’ll have a large population of neurmorphic AGI/uploads well before that.
I don’t believe this would follow; we actually have much stronger evidence that ought to screen off that sort of prior—simply the relatively large differences in human cognitive abilities.
Evolution optimizes population distributions with multiple equilibria and niches; large diversity in many traits are expected especially for highly successful species.
Furthermore what current civilization considers to be useful cognitive abilities often have costs—namely in longer neotany training periods—which don’t always pay off vs quicker to breeding strategies.
There seems to be much more diversity in human cognitive performance than there is in human-brain-energy-efficiency; whether this is due to larger differences in the underlying software (to the extent that this is meaningfully commensurable with differences in hardware) or because smaller differences in that domain result in much larger differences in observable outputs, or both, none of that really takes away from the fact that brain software does not seem to be anywhere near the relevant efficiency frontier, especially since many trade-offs which were operative at an evolutionary scale simply aren’t when it comes to software.
Human mind software evolves at cultural speeds so its recent age isn’t comparably relevant. Diversity in human cognitive capabilities results from the combined oft multiplicative effects of brain hardware differences compounding unique training datasets/experiences.
Its well known in DL that you can’t really compare systems trained on vary different datasets, but that is always the case with humans.
The trained model can change at cultural speeds but the human neural network architecture, hyperparameters, and reward functions can’t. (Or do you think those are already close to optimal for doing science & technology or executing complicated projects etc.?)
(Whatever current or future chips we wind up using for AGI, we’ll almost definitely be able to change the architecture, hyperparameters, and reward functions without fabricating new chips. So I count those as “software not hardware”. I’m unsure how you’re defining those terms.)
Relatedly, if we can run a brain-like algorithm on computer chips at all (or (eventually) use synth-bio to grow brains in vats, or whatever), then we can increase the number of cortical columns / number of neurons / whatever to be 3× more than any human, and hence (presumably) we would get an AI that would be dramatically more insightful than any human who has ever existed. Specifically, it could hold a far richer and more complicated thought in working memory, whereas humans would have to chunk it and explore it sequentially, which makes it harder to notice connections / analogies / interactions between the parts. I’m unclear on how you’re thinking about things like that. It seems pretty important on my models.
You can’t easily completely change either the ANN or BNN architecture/hyperparams after training, as the weights you invest so much compute in learning are largely dependent on those decisions—and actually the architecture is just equivalent to weights. Sure there are ways to add new modules later or regraft things, but that is a very limited scope for improving the already trained modules.
As to your second question—no I don’t think there are huge gains over the human brain arch. In part because the initial arch doesn’t/shouldn’t matter that much. If it does then it wasn’t flexible enough in the first place. One of the key points of my ULM post was pointing out how the human brain—unlike current DL systems—learns the architecture during training through high level sparse wiring patterns. “Architecture” is largely just wiring patterns, and in huge flexible network you can learn architecture.
Sure, but the human brain is already massive and far off chinchilla scaling. It seems much better currently to use your compute/energy budget on running a smaller model much faster (to learn more quickly).
GPT4 probably doesn’t have that same working memory limitation baked into its architecture but it doesn’t seem to matter much. I guess its possible it learns that limitation to imitate humans, but regardless I don’t see much evidence that the human working memory limitation is all that constraining.
I thought your belief was that the human brain is a scaled up chimp brain, right? If so:
If I compare “one human” versus “lots of chimps working together and running at super-speed”, in terms of ability to do science & technology, the former would obviously absolutely crush the latter.
…So by the same token, if I compare “one model that’s like a 3×-scaled-up human brain” to “lots of models that are like normal (non-scaled-up) human brains, working together and running at super-speed”, in terms of ability to do science & technology, it should be at least plausible that the former would absolutely crush the latter, right?
Or if that’s not a good analogy, why not? Thanks.
First the human brain uses perhaps 10x the net effective training compute (3x size, 2x neotany extending training of higher modules, a bit from arch changes), and scale alone leads to new capabilities.
But the main key new capability was the evolution of language, and the resulting cultural revolution. Chimps train on 1e8s of lifetime data or so, and that’s it. Humans train on 1e9s, but that 1e9s dataset is a compression of all the experience of all humans who have ever lived. So the effective dataset size scales with human population size vs being constant, and even a sublinear scaling with population size leads to a radically different regime. The most important inventions driving human civilization progress indirectly or directly drive up that scaling factor.
OK, so in your picture chimps had less training / less scale / worse arch than humans, and this is related to the fact that humans have language and chimps don’t. “Scale alone leads to new capabilities.”
But if we explore the regime of “even more training than humans / even more scale than humans / even better arch than humans”, your claim is that this whole regime is just a giant dead zone where nothing interesting happens, and thus you’re just being inefficient—really you should have split it into multiple smaller models. Correct? If so, why do you think that?
In other words, if scaling up from chimp brains to human brains unlocked new capabilities (namely language), why shouldn’t scaling up from human brains to superhuman brains unlock new capabilities too? Do you think there are no capabilities left, or something?
(Sorry if you’ve already talked about this elsewhere.)
Scale in compute and data—as according to NN scaling laws. The language/culture/tech leading to new effective data scaling regime quickly reconfigured the pareto surface payoff for brain size, so its more of a feedback loop rather than a clear cause effect (which is why I would consider it a foom in terms of evolutionary timescales).
Of course, but the new capabilities are more like new skills, mental programs, and wisdom not metasystems transitions (changes to core scaling regime).
A metasystems transition would be something as profound, rare, and as important as transitioning from effective lifetime training data being a constant to effective lifetime data scaling with population size, or transitioning from non-programmable to programmable.
Zoom in and look at what a large NN is for—what does it do? It can soak up more data to acquire more knowledge/skills, and it also learns faster per timestep (as it’s searching in parallel over a wider circuit space per time step), but the latter is already captured in net training compute anyway. So intelligence is mostly about the volume of search space explored, which scales with net training compute—this is almost an obvious direct consequence of Solomon induction or derivation thereof.
I am not arguing that there are no more metasystems transitions, only that “make brains bigger” doesn’t automatically enable them. The single largest impact of digital minds is probably just speed. Not energy efficiency or software efficiency, just raw speed.