I agree the blackbody formula doesn’t seem that relevant, but it’s also not clear what relevance Jacob is claiming it has. He does discuss that the brain is actively cooled. So let’s look at the conclusion of the section:
Conclusion: The brain is perhaps 1 to 2 OOM larger than the physical limits for a computer of equivalent power, but is constrained to its somewhat larger than minimal size due in part to thermodynamic cooling considerations.
If the temperature-gradient-scaling works and scaling down is free, this is definitely wrong. But you explicitly flag your low confidence in that scaling, and I’m pretty sure it wouldn’t work.* In which case, if the brain were smaller, you’d need either a hotter brain or a colder environment.
I think that makes the conclusion true (with the caveat that ‘considerations’ are not ‘fundamental limits’).
(My gloss of the section is ‘you could potential make the brain smaller, but it’s the size it is because cooling is expensive in a biological context, not necessarily because blind-idiot-god evolution left gains on the table’).
* I can provide some hand-wavy arguments about this if anyone wants.
My gloss of the section is ‘you could potential make the brain smaller, but it’s the size it is because cooling is expensive in a biological context, not necessarily because blind-idiot-god evolution left gains on the table’
I tentatively buy that, but then the argument says little-to-nothing about barriers to AI takeoff. Like, sure, the brain is efficient subject to some constraint which doesn’t apply to engineered compute hardware. More generally, the brain is probably efficient relative to lots of constraints which don’t apply to engineered compute hardware. A hypothetical AI designing hardware will have different constraints.
Either Jacob needs to argue that the same limiting constraints carry over (in which case hypothetical AI can’t readily outperform brains), or he does not have a substantive claim about AI being unable to outperform brains. If there’s even just one constraint which is very binding for brains, but totally tractable for engineered hardware, then that opens the door to AI dramatically outperforming brains.
I tentatively buy that, but then the argument says little-to-nothing about barriers to AI takeoff. Like, sure, the brain is efficient subject to some constraint which doesn’t apply to engineered compute hardware.
The main constraint at minimal device sizes is the thermodynamic limit for irreversible computers, so the wire energy constraint is dominant there.
However the power dissipation/cooling ability for a 3D computer only scales with the surface area d2, whereas compute device density scales with d3 and interconnect scales somewhere in between.
The point of the temperature/cooling section was just to show that shrinking the brain by a factor of X (if possible given space requirements of wire radius etc), would increase surface power density by a factor of X2, but only would decrease wire length&energy by X and would not decrease synapse energy at all.
2D chips scale differently of course: the surface area and heat dissipation tend to both scale with d2. Conventional chips are already approaching miniaturization limits and will dissipate too much power at full activity, but that’s a separate investigation. 3D computers like the brain can’t run that hot given any fixed tech ability to remove heat per unit surface area. 2D computers are also obviously worse in many respects, as long range interconnect bandwidth (to memory) only scales with d rather than the d2 of compute which is basically terrible compared to a 3D system where compute density and long-range interconnect scales d3 and d2 respectively.
Had it turned out that the brain was big because blind-idiot-god left gains on the table, I’d have considered it evidence of more gains lying on other tables and updated towards faster takeoff.
I mean, sure, but I doubt that e.g. Eliezer thinks evolution is inefficient in that sense.
Basically, there are only a handful of specific ways we should expect to be able to beat evolution in terms of general capabilities, a priori:
Some things just haven’t had very much time to evolve, so they’re probably not near optimal. Broca’s area would be an obvious candidate, and more generally whatever things separate human brains from other apes.
There’s ways to nonlocally redesign the whole system to jump from one local optimum to somewhere else.
We’re optimizing against an environment different from the ancestral environment, or structural constraints different from those faced by biological systems, such that some constraints basically cease to be relevant. The relative abundance of energy is one standard example of a relaxed environmental constraint; the birth canal as a limiting factor on human brain size during development or the need to make everything out of cells are standard examples of relaxed structural constraints.
One particularly important sub-case of “different environment”: insofar as the ancestral environment mostly didn’t change very quickly, evolution didn’t necessarily select heavily for very generalizable capabilities. The sphex wasp behavior is a standard example. A hypothetical AI designer would presumably design/select for generalization directly.
(I expect that Eliezer would agree with roughly this characterization, by the way. It’s a very similar way-of-thinking to Inadequate Equilibria, just applied to bio rather than econ.) These kinds of loopholes leave ample space to dramatically improve on the human brain.
Interesting—I think I disagree most with 1. The neuroscience seems pretty clear that the human brain is just a scaled up standard primate brain, the secret sauce is just language (I discuss this now and again in some posts and in my recent part 2). In other words—nothing new about the human brain has had much time to evolve, all evolution did was tweak a few hyperparams mostly around size and neotany (training time): very very much like GPT-N scaling (which my model predicted).
Basically human technology beats evolution because we are not constrained to use self replicating nanobots built out of common locally available materials for everything. A jet airplane design is not something you can easily build out of self replicating nanobots—it requires too many high energy construction processes and rare materials spread across the earth.
Microchip fabs and their outputs are the pinnacle of this difference—requiring rare elements across the periodic table, massively complex global supply chains and many steps of intricate high energy construction/refinement processes all throughout.
What this ends up buying you mostly is very high energy densities—useful for engines, but also for fast processors.
Yeah, the main changes I’d expect in category 1 are just pushing things further in the directions they’re already moving, and then adjusting whatever else needs to be adjusted to match the new hyperparameter values.
One example is brain size: we know brains have generally grown larger in recent evolutionary history, but they’re locally-limited by things like e.g. birth canal size. Circumvent the birth canal, and we can keep pushing in the “bigger brain” direction.
Or, another example is the genetic changes accounting for high IQ among the Ashkenazi. In order to go further in that direction, the various physiological problems those variants can cause need to be offset by other simultaneous changes, which is the sort of thing a designer can do a lot faster than evolution can. (And note that, given how much the Ashkenazi dominated the sciences in their heyday, that’s the sort of change which could by itself produce sufficiently superhuman performance to decisively outperform human science/engineering, if we can go just a few more standard deviations along the same directions.)
… but I do generally expect that the “different environmental/structural constraints” class is still where the most important action is by a wide margin. In particular, the “selection for generality” part is probably pretty big game, as well as selection pressures for group interaction stuff like language (note that AI potentially allows for FAR more efficient communication between instances), and the need for learning everything from scratch in every instance rather than copying, and generally the ability to integrate quantitatively much more information than was typically relevant or available to local problems in the ancestral environment.
Circumvent the birth canal, and we can keep pushing in the “bigger brain” direction.
Chinchilla scaling already suggests the human brain is too big for our lifetime data, and multiple distal lineages that have very little natural size limits (whales, elephants) ended up plateauing around the same OOM similar brain neuron and synapse counts.
Or, another example is the genetic changes accounting for high IQ among the Ashkenazi. In order to go further in that direction,
Human intelligence in terms of brain arch priors also plateaus, the Ashkenazi just selected a bit stronger towards that plateau. Intelligence also has neotany tradeoffs resulting in numerous ecological niches in tribes—faster to breed often wins.
Chinchilla scaling already suggests the human brain is too big for our lifetime data
So I haven’t followed any of the relevant discussion closely, apologies if I’m missing something, but:
IIUC Chinchilla here references a paper talking about tradeoffs between how many artificial neurons a network has and how much data you use to train it; adding either of those requires compute, so to get the best performance where do you spend marginal compute? And the paper comes up with a function for optimal neurons-versus-data for a given amount of compute, under the paradigm we’re currently using for LLMs. And you’re applying this function to humans.
If so, a priori this seems like a bizarre connection for a few reasons, any one of which seems sufficient to sink it entirely:
Is the paper general enough to apply to human neural architecture? By default I would have assumed not, even if it’s more general than just current LLMs.
Is the paper general enough to apply to human training? By default I would have assumed not. (We can perhaps consider translating the human visual field to a number of bits and taking a number of snapshots per second and considering those to be training runs, but… is there any principled reason not to instead translate to 2x or 0.5x the number of bits or snapshots per second? And that’s just the amount of data, to say nothing of how the training works.)
It seems you’re saying “at this amount of data, adding more neurons simply doesn’t help” rather than “at this amount of data and neurons, you’d prefer to add more data”. That’s different from my understanding of the paper but of course it might say that as well or instead of what I think it says.
To be clear, it seems to me that you don’t just need the paper to be giving you a scaling law that can apply to humans, with more human neurons corresponding to more artificial neurons and more human lifetime corresponding to more training data. You also need to know the conversion functions, to say “this (number of human neurons, amount of human lifetime) corresponds to this (number of artificial neurons, amount of training data)” and I’d be surprised if we can pin down the relevant values of either parameter to within an order of magnitude.
...but again, I acknowledge that you know what you’re talking about here much more than I do. And, I don’t really expect to understand if you explain, so you shouldn’t necessarily put much effort into this. But if you think I’m mistaken here, I’d appreciate a few words like “you’re wrong about the comparison I’m drawing” or “you’ve got the right idea but I think the comparison actually does work” or something, and maybe a search term I can use if I do feel like looking into it more.
The architectural design of a brain, which I think of as an prior on the weights, so I sometimes call it the architectural prior. It is encoded in the genome and is the equivalent of the high level pytorch code for a deep learning model.
I agree the blackbody formula doesn’t seem that relevant, but it’s also not clear what relevance Jacob is claiming it has. He does discuss that the brain is actively cooled. So let’s look at the conclusion of the section:
If the temperature-gradient-scaling works and scaling down is free, this is definitely wrong. But you explicitly flag your low confidence in that scaling, and I’m pretty sure it wouldn’t work.* In which case, if the brain were smaller, you’d need either a hotter brain or a colder environment.
I think that makes the conclusion true (with the caveat that ‘considerations’ are not ‘fundamental limits’).
(My gloss of the section is ‘you could potential make the brain smaller, but it’s the size it is because cooling is expensive in a biological context, not necessarily because blind-idiot-god evolution left gains on the table’).
* I can provide some hand-wavy arguments about this if anyone wants.
I tentatively buy that, but then the argument says little-to-nothing about barriers to AI takeoff. Like, sure, the brain is efficient subject to some constraint which doesn’t apply to engineered compute hardware. More generally, the brain is probably efficient relative to lots of constraints which don’t apply to engineered compute hardware. A hypothetical AI designing hardware will have different constraints.
Either Jacob needs to argue that the same limiting constraints carry over (in which case hypothetical AI can’t readily outperform brains), or he does not have a substantive claim about AI being unable to outperform brains. If there’s even just one constraint which is very binding for brains, but totally tractable for engineered hardware, then that opens the door to AI dramatically outperforming brains.
The main constraint at minimal device sizes is the thermodynamic limit for irreversible computers, so the wire energy constraint is dominant there.
However the power dissipation/cooling ability for a 3D computer only scales with the surface area d2, whereas compute device density scales with d3 and interconnect scales somewhere in between.
The point of the temperature/cooling section was just to show that shrinking the brain by a factor of X (if possible given space requirements of wire radius etc), would increase surface power density by a factor of X2, but only would decrease wire length&energy by X and would not decrease synapse energy at all.
2D chips scale differently of course: the surface area and heat dissipation tend to both scale with d2. Conventional chips are already approaching miniaturization limits and will dissipate too much power at full activity, but that’s a separate investigation. 3D computers like the brain can’t run that hot given any fixed tech ability to remove heat per unit surface area. 2D computers are also obviously worse in many respects, as long range interconnect bandwidth (to memory) only scales with d rather than the d2 of compute which is basically terrible compared to a 3D system where compute density and long-range interconnect scales d3 and d2 respectively.
Had it turned out that the brain was big because blind-idiot-god left gains on the table, I’d have considered it evidence of more gains lying on other tables and updated towards faster takeoff.
I mean, sure, but I doubt that e.g. Eliezer thinks evolution is inefficient in that sense.
Basically, there are only a handful of specific ways we should expect to be able to beat evolution in terms of general capabilities, a priori:
Some things just haven’t had very much time to evolve, so they’re probably not near optimal. Broca’s area would be an obvious candidate, and more generally whatever things separate human brains from other apes.
There’s ways to nonlocally redesign the whole system to jump from one local optimum to somewhere else.
We’re optimizing against an environment different from the ancestral environment, or structural constraints different from those faced by biological systems, such that some constraints basically cease to be relevant. The relative abundance of energy is one standard example of a relaxed environmental constraint; the birth canal as a limiting factor on human brain size during development or the need to make everything out of cells are standard examples of relaxed structural constraints.
One particularly important sub-case of “different environment”: insofar as the ancestral environment mostly didn’t change very quickly, evolution didn’t necessarily select heavily for very generalizable capabilities. The sphex wasp behavior is a standard example. A hypothetical AI designer would presumably design/select for generalization directly.
(I expect that Eliezer would agree with roughly this characterization, by the way. It’s a very similar way-of-thinking to Inadequate Equilibria, just applied to bio rather than econ.) These kinds of loopholes leave ample space to dramatically improve on the human brain.
Interesting—I think I disagree most with 1. The neuroscience seems pretty clear that the human brain is just a scaled up standard primate brain, the secret sauce is just language (I discuss this now and again in some posts and in my recent part 2). In other words—nothing new about the human brain has had much time to evolve, all evolution did was tweak a few hyperparams mostly around size and neotany (training time): very very much like GPT-N scaling (which my model predicted).
Basically human technology beats evolution because we are not constrained to use self replicating nanobots built out of common locally available materials for everything. A jet airplane design is not something you can easily build out of self replicating nanobots—it requires too many high energy construction processes and rare materials spread across the earth.
Microchip fabs and their outputs are the pinnacle of this difference—requiring rare elements across the periodic table, massively complex global supply chains and many steps of intricate high energy construction/refinement processes all throughout.
What this ends up buying you mostly is very high energy densities—useful for engines, but also for fast processors.
Yeah, the main changes I’d expect in category 1 are just pushing things further in the directions they’re already moving, and then adjusting whatever else needs to be adjusted to match the new hyperparameter values.
One example is brain size: we know brains have generally grown larger in recent evolutionary history, but they’re locally-limited by things like e.g. birth canal size. Circumvent the birth canal, and we can keep pushing in the “bigger brain” direction.
Or, another example is the genetic changes accounting for high IQ among the Ashkenazi. In order to go further in that direction, the various physiological problems those variants can cause need to be offset by other simultaneous changes, which is the sort of thing a designer can do a lot faster than evolution can. (And note that, given how much the Ashkenazi dominated the sciences in their heyday, that’s the sort of change which could by itself produce sufficiently superhuman performance to decisively outperform human science/engineering, if we can go just a few more standard deviations along the same directions.)
… but I do generally expect that the “different environmental/structural constraints” class is still where the most important action is by a wide margin. In particular, the “selection for generality” part is probably pretty big game, as well as selection pressures for group interaction stuff like language (note that AI potentially allows for FAR more efficient communication between instances), and the need for learning everything from scratch in every instance rather than copying, and generally the ability to integrate quantitatively much more information than was typically relevant or available to local problems in the ancestral environment.
Chinchilla scaling already suggests the human brain is too big for our lifetime data, and multiple distal lineages that have very little natural size limits (whales, elephants) ended up plateauing around the same OOM similar brain neuron and synapse counts.
Human intelligence in terms of brain arch priors also plateaus, the Ashkenazi just selected a bit stronger towards that plateau. Intelligence also has neotany tradeoffs resulting in numerous ecological niches in tribes—faster to breed often wins.
So I haven’t followed any of the relevant discussion closely, apologies if I’m missing something, but:
IIUC Chinchilla here references a paper talking about tradeoffs between how many artificial neurons a network has and how much data you use to train it; adding either of those requires compute, so to get the best performance where do you spend marginal compute? And the paper comes up with a function for optimal neurons-versus-data for a given amount of compute, under the paradigm we’re currently using for LLMs. And you’re applying this function to humans.
If so, a priori this seems like a bizarre connection for a few reasons, any one of which seems sufficient to sink it entirely:
Is the paper general enough to apply to human neural architecture? By default I would have assumed not, even if it’s more general than just current LLMs.
Is the paper general enough to apply to human training? By default I would have assumed not. (We can perhaps consider translating the human visual field to a number of bits and taking a number of snapshots per second and considering those to be training runs, but… is there any principled reason not to instead translate to 2x or 0.5x the number of bits or snapshots per second? And that’s just the amount of data, to say nothing of how the training works.)
It seems you’re saying “at this amount of data, adding more neurons simply doesn’t help” rather than “at this amount of data and neurons, you’d prefer to add more data”. That’s different from my understanding of the paper but of course it might say that as well or instead of what I think it says.
To be clear, it seems to me that you don’t just need the paper to be giving you a scaling law that can apply to humans, with more human neurons corresponding to more artificial neurons and more human lifetime corresponding to more training data. You also need to know the conversion functions, to say “this (number of human neurons, amount of human lifetime) corresponds to this (number of artificial neurons, amount of training data)” and I’d be surprised if we can pin down the relevant values of either parameter to within an order of magnitude.
...but again, I acknowledge that you know what you’re talking about here much more than I do. And, I don’t really expect to understand if you explain, so you shouldn’t necessarily put much effort into this. But if you think I’m mistaken here, I’d appreciate a few words like “you’re wrong about the comparison I’m drawing” or “you’ve got the right idea but I think the comparison actually does work” or something, and maybe a search term I can use if I do feel like looking into it more.
Thanks for your contribution. I would also appreciate a response from Jake.
Why do you think this?
For my understanding: what is a brain arch?
The architectural design of a brain, which I think of as an prior on the weights, so I sometimes call it the architectural prior. It is encoded in the genome and is the equivalent of the high level pytorch code for a deep learning model.