Circumvent the birth canal, and we can keep pushing in the “bigger brain” direction.
Chinchilla scaling already suggests the human brain is too big for our lifetime data, and multiple distal lineages that have very little natural size limits (whales, elephants) ended up plateauing around the same OOM similar brain neuron and synapse counts.
Or, another example is the genetic changes accounting for high IQ among the Ashkenazi. In order to go further in that direction,
Human intelligence in terms of brain arch priors also plateaus, the Ashkenazi just selected a bit stronger towards that plateau. Intelligence also has neotany tradeoffs resulting in numerous ecological niches in tribes—faster to breed often wins.
Chinchilla scaling already suggests the human brain is too big for our lifetime data
So I haven’t followed any of the relevant discussion closely, apologies if I’m missing something, but:
IIUC Chinchilla here references a paper talking about tradeoffs between how many artificial neurons a network has and how much data you use to train it; adding either of those requires compute, so to get the best performance where do you spend marginal compute? And the paper comes up with a function for optimal neurons-versus-data for a given amount of compute, under the paradigm we’re currently using for LLMs. And you’re applying this function to humans.
If so, a priori this seems like a bizarre connection for a few reasons, any one of which seems sufficient to sink it entirely:
Is the paper general enough to apply to human neural architecture? By default I would have assumed not, even if it’s more general than just current LLMs.
Is the paper general enough to apply to human training? By default I would have assumed not. (We can perhaps consider translating the human visual field to a number of bits and taking a number of snapshots per second and considering those to be training runs, but… is there any principled reason not to instead translate to 2x or 0.5x the number of bits or snapshots per second? And that’s just the amount of data, to say nothing of how the training works.)
It seems you’re saying “at this amount of data, adding more neurons simply doesn’t help” rather than “at this amount of data and neurons, you’d prefer to add more data”. That’s different from my understanding of the paper but of course it might say that as well or instead of what I think it says.
To be clear, it seems to me that you don’t just need the paper to be giving you a scaling law that can apply to humans, with more human neurons corresponding to more artificial neurons and more human lifetime corresponding to more training data. You also need to know the conversion functions, to say “this (number of human neurons, amount of human lifetime) corresponds to this (number of artificial neurons, amount of training data)” and I’d be surprised if we can pin down the relevant values of either parameter to within an order of magnitude.
...but again, I acknowledge that you know what you’re talking about here much more than I do. And, I don’t really expect to understand if you explain, so you shouldn’t necessarily put much effort into this. But if you think I’m mistaken here, I’d appreciate a few words like “you’re wrong about the comparison I’m drawing” or “you’ve got the right idea but I think the comparison actually does work” or something, and maybe a search term I can use if I do feel like looking into it more.
The architectural design of a brain, which I think of as an prior on the weights, so I sometimes call it the architectural prior. It is encoded in the genome and is the equivalent of the high level pytorch code for a deep learning model.
Chinchilla scaling already suggests the human brain is too big for our lifetime data, and multiple distal lineages that have very little natural size limits (whales, elephants) ended up plateauing around the same OOM similar brain neuron and synapse counts.
Human intelligence in terms of brain arch priors also plateaus, the Ashkenazi just selected a bit stronger towards that plateau. Intelligence also has neotany tradeoffs resulting in numerous ecological niches in tribes—faster to breed often wins.
So I haven’t followed any of the relevant discussion closely, apologies if I’m missing something, but:
IIUC Chinchilla here references a paper talking about tradeoffs between how many artificial neurons a network has and how much data you use to train it; adding either of those requires compute, so to get the best performance where do you spend marginal compute? And the paper comes up with a function for optimal neurons-versus-data for a given amount of compute, under the paradigm we’re currently using for LLMs. And you’re applying this function to humans.
If so, a priori this seems like a bizarre connection for a few reasons, any one of which seems sufficient to sink it entirely:
Is the paper general enough to apply to human neural architecture? By default I would have assumed not, even if it’s more general than just current LLMs.
Is the paper general enough to apply to human training? By default I would have assumed not. (We can perhaps consider translating the human visual field to a number of bits and taking a number of snapshots per second and considering those to be training runs, but… is there any principled reason not to instead translate to 2x or 0.5x the number of bits or snapshots per second? And that’s just the amount of data, to say nothing of how the training works.)
It seems you’re saying “at this amount of data, adding more neurons simply doesn’t help” rather than “at this amount of data and neurons, you’d prefer to add more data”. That’s different from my understanding of the paper but of course it might say that as well or instead of what I think it says.
To be clear, it seems to me that you don’t just need the paper to be giving you a scaling law that can apply to humans, with more human neurons corresponding to more artificial neurons and more human lifetime corresponding to more training data. You also need to know the conversion functions, to say “this (number of human neurons, amount of human lifetime) corresponds to this (number of artificial neurons, amount of training data)” and I’d be surprised if we can pin down the relevant values of either parameter to within an order of magnitude.
...but again, I acknowledge that you know what you’re talking about here much more than I do. And, I don’t really expect to understand if you explain, so you shouldn’t necessarily put much effort into this. But if you think I’m mistaken here, I’d appreciate a few words like “you’re wrong about the comparison I’m drawing” or “you’ve got the right idea but I think the comparison actually does work” or something, and maybe a search term I can use if I do feel like looking into it more.
Thanks for your contribution. I would also appreciate a response from Jake.
Why do you think this?
For my understanding: what is a brain arch?
The architectural design of a brain, which I think of as an prior on the weights, so I sometimes call it the architectural prior. It is encoded in the genome and is the equivalent of the high level pytorch code for a deep learning model.