For the record, I did account for language-related cortical areas being ≈10× smaller than the whole cortex, in my Section 3.3.1 comparison. I was guessing that a double-pass through GPT-3 involves 10× fewer FLOP than running language-related cortical areas for 0.3 seconds, and those two things strike me as accomplishing a vaguely comparable amount of useful thinking stuff, I figure.
In your analysis the brain is using perhaps 1e13 flops/s (which I don’t disagree with much), and if linguistic cortex is 10% of that we get 1e12 flops/s, or 300B flops for 0.3 seconds.
GPT-3 uses all of its nearly 200B parameters for one forward pass, but the flops is probably 2x that (because the attention layers don’t use the long term params), and then you are using a ‘double pass’, so closer to 800B flops for GPT-3. Perhaps the total brain is 1e14 flops/s, so 3T flops for 0.3s of linguistic cortex but regardless its still using roughly the same amount of flops within our uncertainty range.
However as I mentioned earlier a model like GPT-3 running inference on a GPU is much less efficient than this, as many of the matrix mult calls (in training, parallelizing over time) become vector matrix mult calls and thus RAM bandwidth limited.
So if the hypothesis is “those cortical areas require a massive number of synapses because that’s how the brain reduces the number of FLOP involved in querying the model”, then I find that hypothesis hard to believe.
The brain’s sparsity obviously reduces the equivalent flop count vs running the same exact model on dense mat mul hardware, but interestingly enough ends up in roughly the same “flops used for inference” regime as the smaller dense model. However it is getting by with perhaps 100x less training data (for the linguistic cortex at least).
If your argument is “data efficiency is an important part of the secret sauce of human intelligence, not just in training-from-scratch but also in online learning, and the brain is much better at that than GPT-3, and we can’t directly see that because GPT-3 doesn’t have online learning in the first place, and the reason that the brain is much better at that is because it has this super-duper-over-parametrized model”, then OK that’s a coherent argument, even if I happen to think it’s mostly wrong.
The scaling laws indicate that performance mostly depends on net training compute, and it doesn’t matter as much as you think how you allocate that between size/params (and thus inference flops) and time/data (training steps). A larger model spends more compute per training step to learn more from less data. GPT-3 used 3e23 flops for training, whereas the linguistic cortex uses perhaps 1e21 to 1e22 (1e13 * 1e9s), but GPT-3 trains on almost 3 OOM more equivalent token data and thus can be much smaller in proportion.
So the brain is more flop efficient, but only because it’s equivalent to a much larger dense model trained on much less data.
LLMs on GPUs are heavily RAM constrained but have plentiful data so they naturally have moved to the (smaller model, trained longer regime) vs the brain. For the brain synapses are fairly cheap, but training data time is not.
In your analysis the brain is using perhaps 1e13 flops/s (which I don’t disagree with much), and if linguistic cortex is 10% of that we get 1e12 flops/s, or 300B flops for 0.3 seconds.
GPT-3 uses all of its nearly 200B parameters for one forward pass, but the flops is probably 2x that (because the attention layers don’t use the long term params), and then you are using a ‘double pass’, so closer to 800B flops for GPT-3. Perhaps the total brain is 1e14 flops/s, so 3T flops for 0.3s of linguistic cortex but regardless its still using roughly the same amount of flops within our uncertainty range.
However as I mentioned earlier a model like GPT-3 running inference on a GPU is much less efficient than this, as many of the matrix mult calls (in training, parallelizing over time) become vector matrix mult calls and thus RAM bandwidth limited.
The brain’s sparsity obviously reduces the equivalent flop count vs running the same exact model on dense mat mul hardware, but interestingly enough ends up in roughly the same “flops used for inference” regime as the smaller dense model. However it is getting by with perhaps 100x less training data (for the linguistic cortex at least).
The scaling laws indicate that performance mostly depends on net training compute, and it doesn’t matter as much as you think how you allocate that between size/params (and thus inference flops) and time/data (training steps). A larger model spends more compute per training step to learn more from less data. GPT-3 used 3e23 flops for training, whereas the linguistic cortex uses perhaps 1e21 to 1e22 (1e13 * 1e9s), but GPT-3 trains on almost 3 OOM more equivalent token data and thus can be much smaller in proportion.
So the brain is more flop efficient, but only because it’s equivalent to a much larger dense model trained on much less data.
LLMs on GPUs are heavily RAM constrained but have plentiful data so they naturally have moved to the (smaller model, trained longer regime) vs the brain. For the brain synapses are fairly cheap, but training data time is not.