Had it turned out that the brain was big because blind-idiot-god left gains on the table, I’d have considered it evidence of more gains lying on other tables and updated towards faster takeoff.
I mean, sure, but I doubt that e.g. Eliezer thinks evolution is inefficient in that sense.
Basically, there are only a handful of specific ways we should expect to be able to beat evolution in terms of general capabilities, a priori:
Some things just haven’t had very much time to evolve, so they’re probably not near optimal. Broca’s area would be an obvious candidate, and more generally whatever things separate human brains from other apes.
There’s ways to nonlocally redesign the whole system to jump from one local optimum to somewhere else.
We’re optimizing against an environment different from the ancestral environment, or structural constraints different from those faced by biological systems, such that some constraints basically cease to be relevant. The relative abundance of energy is one standard example of a relaxed environmental constraint; the birth canal as a limiting factor on human brain size during development or the need to make everything out of cells are standard examples of relaxed structural constraints.
One particularly important sub-case of “different environment”: insofar as the ancestral environment mostly didn’t change very quickly, evolution didn’t necessarily select heavily for very generalizable capabilities. The sphex wasp behavior is a standard example. A hypothetical AI designer would presumably design/select for generalization directly.
(I expect that Eliezer would agree with roughly this characterization, by the way. It’s a very similar way-of-thinking to Inadequate Equilibria, just applied to bio rather than econ.) These kinds of loopholes leave ample space to dramatically improve on the human brain.
Interesting—I think I disagree most with 1. The neuroscience seems pretty clear that the human brain is just a scaled up standard primate brain, the secret sauce is just language (I discuss this now and again in some posts and in my recent part 2). In other words—nothing new about the human brain has had much time to evolve, all evolution did was tweak a few hyperparams mostly around size and neotany (training time): very very much like GPT-N scaling (which my model predicted).
Basically human technology beats evolution because we are not constrained to use self replicating nanobots built out of common locally available materials for everything. A jet airplane design is not something you can easily build out of self replicating nanobots—it requires too many high energy construction processes and rare materials spread across the earth.
Microchip fabs and their outputs are the pinnacle of this difference—requiring rare elements across the periodic table, massively complex global supply chains and many steps of intricate high energy construction/refinement processes all throughout.
What this ends up buying you mostly is very high energy densities—useful for engines, but also for fast processors.
Yeah, the main changes I’d expect in category 1 are just pushing things further in the directions they’re already moving, and then adjusting whatever else needs to be adjusted to match the new hyperparameter values.
One example is brain size: we know brains have generally grown larger in recent evolutionary history, but they’re locally-limited by things like e.g. birth canal size. Circumvent the birth canal, and we can keep pushing in the “bigger brain” direction.
Or, another example is the genetic changes accounting for high IQ among the Ashkenazi. In order to go further in that direction, the various physiological problems those variants can cause need to be offset by other simultaneous changes, which is the sort of thing a designer can do a lot faster than evolution can. (And note that, given how much the Ashkenazi dominated the sciences in their heyday, that’s the sort of change which could by itself produce sufficiently superhuman performance to decisively outperform human science/engineering, if we can go just a few more standard deviations along the same directions.)
… but I do generally expect that the “different environmental/structural constraints” class is still where the most important action is by a wide margin. In particular, the “selection for generality” part is probably pretty big game, as well as selection pressures for group interaction stuff like language (note that AI potentially allows for FAR more efficient communication between instances), and the need for learning everything from scratch in every instance rather than copying, and generally the ability to integrate quantitatively much more information than was typically relevant or available to local problems in the ancestral environment.
Circumvent the birth canal, and we can keep pushing in the “bigger brain” direction.
Chinchilla scaling already suggests the human brain is too big for our lifetime data, and multiple distal lineages that have very little natural size limits (whales, elephants) ended up plateauing around the same OOM similar brain neuron and synapse counts.
Or, another example is the genetic changes accounting for high IQ among the Ashkenazi. In order to go further in that direction,
Human intelligence in terms of brain arch priors also plateaus, the Ashkenazi just selected a bit stronger towards that plateau. Intelligence also has neotany tradeoffs resulting in numerous ecological niches in tribes—faster to breed often wins.
Chinchilla scaling already suggests the human brain is too big for our lifetime data
So I haven’t followed any of the relevant discussion closely, apologies if I’m missing something, but:
IIUC Chinchilla here references a paper talking about tradeoffs between how many artificial neurons a network has and how much data you use to train it; adding either of those requires compute, so to get the best performance where do you spend marginal compute? And the paper comes up with a function for optimal neurons-versus-data for a given amount of compute, under the paradigm we’re currently using for LLMs. And you’re applying this function to humans.
If so, a priori this seems like a bizarre connection for a few reasons, any one of which seems sufficient to sink it entirely:
Is the paper general enough to apply to human neural architecture? By default I would have assumed not, even if it’s more general than just current LLMs.
Is the paper general enough to apply to human training? By default I would have assumed not. (We can perhaps consider translating the human visual field to a number of bits and taking a number of snapshots per second and considering those to be training runs, but… is there any principled reason not to instead translate to 2x or 0.5x the number of bits or snapshots per second? And that’s just the amount of data, to say nothing of how the training works.)
It seems you’re saying “at this amount of data, adding more neurons simply doesn’t help” rather than “at this amount of data and neurons, you’d prefer to add more data”. That’s different from my understanding of the paper but of course it might say that as well or instead of what I think it says.
To be clear, it seems to me that you don’t just need the paper to be giving you a scaling law that can apply to humans, with more human neurons corresponding to more artificial neurons and more human lifetime corresponding to more training data. You also need to know the conversion functions, to say “this (number of human neurons, amount of human lifetime) corresponds to this (number of artificial neurons, amount of training data)” and I’d be surprised if we can pin down the relevant values of either parameter to within an order of magnitude.
...but again, I acknowledge that you know what you’re talking about here much more than I do. And, I don’t really expect to understand if you explain, so you shouldn’t necessarily put much effort into this. But if you think I’m mistaken here, I’d appreciate a few words like “you’re wrong about the comparison I’m drawing” or “you’ve got the right idea but I think the comparison actually does work” or something, and maybe a search term I can use if I do feel like looking into it more.
The architectural design of a brain, which I think of as an prior on the weights, so I sometimes call it the architectural prior. It is encoded in the genome and is the equivalent of the high level pytorch code for a deep learning model.
Had it turned out that the brain was big because blind-idiot-god left gains on the table, I’d have considered it evidence of more gains lying on other tables and updated towards faster takeoff.
I mean, sure, but I doubt that e.g. Eliezer thinks evolution is inefficient in that sense.
Basically, there are only a handful of specific ways we should expect to be able to beat evolution in terms of general capabilities, a priori:
Some things just haven’t had very much time to evolve, so they’re probably not near optimal. Broca’s area would be an obvious candidate, and more generally whatever things separate human brains from other apes.
There’s ways to nonlocally redesign the whole system to jump from one local optimum to somewhere else.
We’re optimizing against an environment different from the ancestral environment, or structural constraints different from those faced by biological systems, such that some constraints basically cease to be relevant. The relative abundance of energy is one standard example of a relaxed environmental constraint; the birth canal as a limiting factor on human brain size during development or the need to make everything out of cells are standard examples of relaxed structural constraints.
One particularly important sub-case of “different environment”: insofar as the ancestral environment mostly didn’t change very quickly, evolution didn’t necessarily select heavily for very generalizable capabilities. The sphex wasp behavior is a standard example. A hypothetical AI designer would presumably design/select for generalization directly.
(I expect that Eliezer would agree with roughly this characterization, by the way. It’s a very similar way-of-thinking to Inadequate Equilibria, just applied to bio rather than econ.) These kinds of loopholes leave ample space to dramatically improve on the human brain.
Interesting—I think I disagree most with 1. The neuroscience seems pretty clear that the human brain is just a scaled up standard primate brain, the secret sauce is just language (I discuss this now and again in some posts and in my recent part 2). In other words—nothing new about the human brain has had much time to evolve, all evolution did was tweak a few hyperparams mostly around size and neotany (training time): very very much like GPT-N scaling (which my model predicted).
Basically human technology beats evolution because we are not constrained to use self replicating nanobots built out of common locally available materials for everything. A jet airplane design is not something you can easily build out of self replicating nanobots—it requires too many high energy construction processes and rare materials spread across the earth.
Microchip fabs and their outputs are the pinnacle of this difference—requiring rare elements across the periodic table, massively complex global supply chains and many steps of intricate high energy construction/refinement processes all throughout.
What this ends up buying you mostly is very high energy densities—useful for engines, but also for fast processors.
Yeah, the main changes I’d expect in category 1 are just pushing things further in the directions they’re already moving, and then adjusting whatever else needs to be adjusted to match the new hyperparameter values.
One example is brain size: we know brains have generally grown larger in recent evolutionary history, but they’re locally-limited by things like e.g. birth canal size. Circumvent the birth canal, and we can keep pushing in the “bigger brain” direction.
Or, another example is the genetic changes accounting for high IQ among the Ashkenazi. In order to go further in that direction, the various physiological problems those variants can cause need to be offset by other simultaneous changes, which is the sort of thing a designer can do a lot faster than evolution can. (And note that, given how much the Ashkenazi dominated the sciences in their heyday, that’s the sort of change which could by itself produce sufficiently superhuman performance to decisively outperform human science/engineering, if we can go just a few more standard deviations along the same directions.)
… but I do generally expect that the “different environmental/structural constraints” class is still where the most important action is by a wide margin. In particular, the “selection for generality” part is probably pretty big game, as well as selection pressures for group interaction stuff like language (note that AI potentially allows for FAR more efficient communication between instances), and the need for learning everything from scratch in every instance rather than copying, and generally the ability to integrate quantitatively much more information than was typically relevant or available to local problems in the ancestral environment.
Chinchilla scaling already suggests the human brain is too big for our lifetime data, and multiple distal lineages that have very little natural size limits (whales, elephants) ended up plateauing around the same OOM similar brain neuron and synapse counts.
Human intelligence in terms of brain arch priors also plateaus, the Ashkenazi just selected a bit stronger towards that plateau. Intelligence also has neotany tradeoffs resulting in numerous ecological niches in tribes—faster to breed often wins.
So I haven’t followed any of the relevant discussion closely, apologies if I’m missing something, but:
IIUC Chinchilla here references a paper talking about tradeoffs between how many artificial neurons a network has and how much data you use to train it; adding either of those requires compute, so to get the best performance where do you spend marginal compute? And the paper comes up with a function for optimal neurons-versus-data for a given amount of compute, under the paradigm we’re currently using for LLMs. And you’re applying this function to humans.
If so, a priori this seems like a bizarre connection for a few reasons, any one of which seems sufficient to sink it entirely:
Is the paper general enough to apply to human neural architecture? By default I would have assumed not, even if it’s more general than just current LLMs.
Is the paper general enough to apply to human training? By default I would have assumed not. (We can perhaps consider translating the human visual field to a number of bits and taking a number of snapshots per second and considering those to be training runs, but… is there any principled reason not to instead translate to 2x or 0.5x the number of bits or snapshots per second? And that’s just the amount of data, to say nothing of how the training works.)
It seems you’re saying “at this amount of data, adding more neurons simply doesn’t help” rather than “at this amount of data and neurons, you’d prefer to add more data”. That’s different from my understanding of the paper but of course it might say that as well or instead of what I think it says.
To be clear, it seems to me that you don’t just need the paper to be giving you a scaling law that can apply to humans, with more human neurons corresponding to more artificial neurons and more human lifetime corresponding to more training data. You also need to know the conversion functions, to say “this (number of human neurons, amount of human lifetime) corresponds to this (number of artificial neurons, amount of training data)” and I’d be surprised if we can pin down the relevant values of either parameter to within an order of magnitude.
...but again, I acknowledge that you know what you’re talking about here much more than I do. And, I don’t really expect to understand if you explain, so you shouldn’t necessarily put much effort into this. But if you think I’m mistaken here, I’d appreciate a few words like “you’re wrong about the comparison I’m drawing” or “you’ve got the right idea but I think the comparison actually does work” or something, and maybe a search term I can use if I do feel like looking into it more.
Thanks for your contribution. I would also appreciate a response from Jake.
Why do you think this?
For my understanding: what is a brain arch?
The architectural design of a brain, which I think of as an prior on the weights, so I sometimes call it the architectural prior. It is encoded in the genome and is the equivalent of the high level pytorch code for a deep learning model.