I disagree with “uncontroversial”. Just off the top of my head, people who I’m pretty sure would disagree with your “uncontroversial” claim include
Uncontroversial was perhaps a bit tongue-in-cheek, but that claim is specifically about a narrow correspondence between LLMs and linguistic cortex, not about LLMs and the entire brain or the entire cortex.
And this claim should now be uncontroversial. The neuroscience experiments have been done, and linguistic cortex computes something similar to what LLMs compute, and almost certainly uses a similar predictive training objective. It obviously implements those computations in a completely different way on very different hardware, but they are mostly the same computations nonetheless—because the task itself determines the solution.
Deep learning algorithms trained to predict masked words from large amount of text have recently been shown to generate activations similar to those of the human brain. However, what drives this similarity remains currently unknown. Here, we systematically compare a variety of deep language models to identify the computational principles that lead them to generate brain-like representations of sentences
Here, we report a first step toward addressing this gap by connecting recent artificial neural networks from machine learning to human recordings during language processing. We find that the most powerful models predict neural and behavioral responses across different datasets up to noise levels.
We found a striking correspondence between the layer-by-layer sequence of embeddings from GPT2-XL and the temporal sequence of neural activity in language areas. In addition, we found evidence for the gradual accumulation of recurrent information along the linguistic processing hierarchy. However, we also noticed additional neural processes that took place in the brain, but not in DLMs, during the processing of surprising (unpredictable) words. These findings point to a connection between language processing in humans and DLMs where the layer-by-layer accumulation of contextual information in DLM embeddings matches the temporal dynamics of neural activity in high-order language areas.
Then “my model of you” would reply that GPT-3 is much smaller / simpler than the brain, and that this difference is the very important secret sauce of human intelligence, and the “thinking per FLOP” comparison should not be brain-vs-GPT-3 but brain-vs-super-scaled-up-GPT-N, and in that case the brain would crush it.
Scaling up GPT-3 by itself is like scaling up linguistic cortex by itself, and doesn’t lead to AGI any more/less than that would (pretty straightforward consequence of the LLM <-> linguistic_cortex (mostly) functional equivalence).
In the OP (Section 3.3.1) I talk about why I don’t buy that—I don’t think it’s the case that the brain gets dramatically more “bang for its buck” / “thinking per FLOP” than GPT-3. In fact, it seems to me to be the other way around.
The comparison should between GPT-3 and linguistic-cortex, not the whole brain. For inference the linguistic cortex uses many orders of magnitude less energy to perform the same task. For training it uses many orders of magnitude less energy to reach the same capability, and several OOM less data. In terms of flops-equivalent it’s perhaps 1e22 sparse flops for training linguistic cortex (1e13 flops * 1e9 seconds) vs 3e23 flops for training GPT-3. So fairly close, but the brain is probably trading some compute efficiency for data efficiency.
The comparison should between GPT-3 and linguistic-cortex
For the record, I did account for language-related cortical areas being ≈10× smaller than the whole cortex, in my Section 3.3.1 comparison. I was guessing that a double-pass through GPT-3 involves 10× fewer FLOP than running language-related cortical areas for 0.3 seconds, and those two things strike me as accomplishing a vaguely comparable amount of useful thinking stuff, I figure.
So if the hypothesis is “those cortical areas require a massive number of synapses because that’s how the brain reduces the number of FLOP involved in querying the model”, then I find that hypothesis hard to believe. You would have to say that the brain’s model is inherently much much more complicated than GPT-3, such that even after putting it in this heavy-on-synapses-lite-on-FLOP format, it still takes much more FLOP to query the brain’s language model than to query GPT-3. And I don’t think that. (Although I suppose this is an area where reasonable people can disagree.)
For inference the linguistic cortex uses many orders of magnitude less energy to perform the same task.
I don’t think energy use is important. For example, if a silicon chip takes 1000× more energy to do the same calculations as a brain, nobody would care. Indeed, I think they’d barely even notice—the electricity costs would still be much less than my local minimum wage. (20 W × 1000 × 10¢/kWh = $2/hr. Maybe a bit more after HVAC and so on.).
I’ve noticed that you bring up energy consumption with some regularity, so I guess you must think that energy efficiency is very important, but I don’t understand why you think that.
For training it uses many orders of magnitude less energy to reach the same capability, and several OOM less data…So fairly close, but the brain is probably trading some compute efficiency for data efficiency.
Other than Section 4, this post was about using an AGI, not training it from scratch. If your argument is “data efficiency is an important part of the secret sauce of human intelligence, not just in training-from-scratch but also in online learning, and the brain is much better at that than GPT-3, and we can’t directly see that because GPT-3 doesn’t have online learning in the first place, and the reason that the brain is much better at that is because it has this super-duper-over-parametrized model”, then OK that’s a coherent argument, even if I happen to think it’s mostly wrong. (Is that your argument?)
And this claim should now be uncontroversial. The neuroscience experiments have been done, and linguistic cortex computes something similar to what LLMs compute, and almost certainly uses a similar predictive training objective. It obviously implements those computations in a completely different way on very different hardware, but they are mostly the same computations nonetheless—because the task itself determines the solution.
Suppose (for the sake of argument—I don’t actually believe this) that human visual cortex is literally a ConvNet. And suppose that human scientists have never had the idea of ConvNets, but they have invented fully-connected feedforward neural nets. So they set about testing the hypothesis that “visual cortex is a fully-connected feedforward neural net”. I suspect that they would find a lot of evidence that apparently confirms this hypothesis, of the same sorts that you describe. For example, similar features would be learned in similar layers. There would be some puzzling discrepancies—especially sample efficiency, and probably also the handling of weird out-of-distribution inputs—but lots of experiments would miss those. So then (in this hypothetical universe) many people would be trumpeting the conclusion: “visual cortex is a fully-connected feedforward DNN”! But they would be wrong! And the careful neuroscientists—the ones who are scrutinizing brain structures, and/or doing experiments more sophisticated than correlating activities in unmanipulated naturalistic data, etc.—would be well aware of that.
There’s a skeptical discussion about specifically LLMs-vs-brains, with some references, in the first part of Section 5 of Bowers et al..
For the record, I did account for language-related cortical areas being ≈10× smaller than the whole cortex, in my Section 3.3.1 comparison. I was guessing that a double-pass through GPT-3 involves 10× fewer FLOP than running language-related cortical areas for 0.3 seconds, and those two things strike me as accomplishing a vaguely comparable amount of useful thinking stuff, I figure.
In your analysis the brain is using perhaps 1e13 flops/s (which I don’t disagree with much), and if linguistic cortex is 10% of that we get 1e12 flops/s, or 300B flops for 0.3 seconds.
GPT-3 uses all of its nearly 200B parameters for one forward pass, but the flops is probably 2x that (because the attention layers don’t use the long term params), and then you are using a ‘double pass’, so closer to 800B flops for GPT-3. Perhaps the total brain is 1e14 flops/s, so 3T flops for 0.3s of linguistic cortex but regardless its still using roughly the same amount of flops within our uncertainty range.
However as I mentioned earlier a model like GPT-3 running inference on a GPU is much less efficient than this, as many of the matrix mult calls (in training, parallelizing over time) become vector matrix mult calls and thus RAM bandwidth limited.
So if the hypothesis is “those cortical areas require a massive number of synapses because that’s how the brain reduces the number of FLOP involved in querying the model”, then I find that hypothesis hard to believe.
The brain’s sparsity obviously reduces the equivalent flop count vs running the same exact model on dense mat mul hardware, but interestingly enough ends up in roughly the same “flops used for inference” regime as the smaller dense model. However it is getting by with perhaps 100x less training data (for the linguistic cortex at least).
If your argument is “data efficiency is an important part of the secret sauce of human intelligence, not just in training-from-scratch but also in online learning, and the brain is much better at that than GPT-3, and we can’t directly see that because GPT-3 doesn’t have online learning in the first place, and the reason that the brain is much better at that is because it has this super-duper-over-parametrized model”, then OK that’s a coherent argument, even if I happen to think it’s mostly wrong.
The scaling laws indicate that performance mostly depends on net training compute, and it doesn’t matter as much as you think how you allocate that between size/params (and thus inference flops) and time/data (training steps). A larger model spends more compute per training step to learn more from less data. GPT-3 used 3e23 flops for training, whereas the linguistic cortex uses perhaps 1e21 to 1e22 (1e13 * 1e9s), but GPT-3 trains on almost 3 OOM more equivalent token data and thus can be much smaller in proportion.
So the brain is more flop efficient, but only because it’s equivalent to a much larger dense model trained on much less data.
LLMs on GPUs are heavily RAM constrained but have plentiful data so they naturally have moved to the (smaller model, trained longer regime) vs the brain. For the brain synapses are fairly cheap, but training data time is not.
I glanced at the first paper you cited, and it seems to show a very weak form of the statements you made. AFAICT their results are more like “we found brain areas that light up when the person reads ‘cat’, just like how this part of the neural net lights up when given input ‘cat’” and less like “the LLM is useful for other tasks in the same way as the neural version is useful for other tasks”. Am I confused about what the paper says, and if so, how? What sort of claim are you making?
Uncontroversial was perhaps a bit tongue-in-cheek, but that claim is specifically about a narrow correspondence between LLMs and linguistic cortex, not about LLMs and the entire brain or the entire cortex.
And this claim should now be uncontroversial. The neuroscience experiments have been done, and linguistic cortex computes something similar to what LLMs compute, and almost certainly uses a similar predictive training objective. It obviously implements those computations in a completely different way on very different hardware, but they are mostly the same computations nonetheless—because the task itself determines the solution.
Examples from recent neurosci literature:
From “Brains and algorithms partially converge in natural language processing”:
From “The neural architecture of language: Integrative modeling converges on predictive processing”:
From “Correspondence between the layered structure of deep language models and temporal structure of natural language processing in the human brain”
Scaling up GPT-3 by itself is like scaling up linguistic cortex by itself, and doesn’t lead to AGI any more/less than that would (pretty straightforward consequence of the LLM <-> linguistic_cortex (mostly) functional equivalence).
The comparison should between GPT-3 and linguistic-cortex, not the whole brain. For inference the linguistic cortex uses many orders of magnitude less energy to perform the same task. For training it uses many orders of magnitude less energy to reach the same capability, and several OOM less data. In terms of flops-equivalent it’s perhaps 1e22 sparse flops for training linguistic cortex (1e13 flops * 1e9 seconds) vs 3e23 flops for training GPT-3. So fairly close, but the brain is probably trading some compute efficiency for data efficiency.
For the record, I did account for language-related cortical areas being ≈10× smaller than the whole cortex, in my Section 3.3.1 comparison. I was guessing that a double-pass through GPT-3 involves 10× fewer FLOP than running language-related cortical areas for 0.3 seconds, and those two things strike me as accomplishing a vaguely comparable amount of useful thinking stuff, I figure.
So if the hypothesis is “those cortical areas require a massive number of synapses because that’s how the brain reduces the number of FLOP involved in querying the model”, then I find that hypothesis hard to believe. You would have to say that the brain’s model is inherently much much more complicated than GPT-3, such that even after putting it in this heavy-on-synapses-lite-on-FLOP format, it still takes much more FLOP to query the brain’s language model than to query GPT-3. And I don’t think that. (Although I suppose this is an area where reasonable people can disagree.)
I don’t think energy use is important. For example, if a silicon chip takes 1000× more energy to do the same calculations as a brain, nobody would care. Indeed, I think they’d barely even notice—the electricity costs would still be much less than my local minimum wage. (20 W × 1000 × 10¢/kWh = $2/hr. Maybe a bit more after HVAC and so on.).
I’ve noticed that you bring up energy consumption with some regularity, so I guess you must think that energy efficiency is very important, but I don’t understand why you think that.
Other than Section 4, this post was about using an AGI, not training it from scratch. If your argument is “data efficiency is an important part of the secret sauce of human intelligence, not just in training-from-scratch but also in online learning, and the brain is much better at that than GPT-3, and we can’t directly see that because GPT-3 doesn’t have online learning in the first place, and the reason that the brain is much better at that is because it has this super-duper-over-parametrized model”, then OK that’s a coherent argument, even if I happen to think it’s mostly wrong. (Is that your argument?)
Suppose (for the sake of argument—I don’t actually believe this) that human visual cortex is literally a ConvNet. And suppose that human scientists have never had the idea of ConvNets, but they have invented fully-connected feedforward neural nets. So they set about testing the hypothesis that “visual cortex is a fully-connected feedforward neural net”. I suspect that they would find a lot of evidence that apparently confirms this hypothesis, of the same sorts that you describe. For example, similar features would be learned in similar layers. There would be some puzzling discrepancies—especially sample efficiency, and probably also the handling of weird out-of-distribution inputs—but lots of experiments would miss those. So then (in this hypothetical universe) many people would be trumpeting the conclusion: “visual cortex is a fully-connected feedforward DNN”! But they would be wrong! And the careful neuroscientists—the ones who are scrutinizing brain structures, and/or doing experiments more sophisticated than correlating activities in unmanipulated naturalistic data, etc.—would be well aware of that.
There’s a skeptical discussion about specifically LLMs-vs-brains, with some references, in the first part of Section 5 of Bowers et al..
In your analysis the brain is using perhaps 1e13 flops/s (which I don’t disagree with much), and if linguistic cortex is 10% of that we get 1e12 flops/s, or 300B flops for 0.3 seconds.
GPT-3 uses all of its nearly 200B parameters for one forward pass, but the flops is probably 2x that (because the attention layers don’t use the long term params), and then you are using a ‘double pass’, so closer to 800B flops for GPT-3. Perhaps the total brain is 1e14 flops/s, so 3T flops for 0.3s of linguistic cortex but regardless its still using roughly the same amount of flops within our uncertainty range.
However as I mentioned earlier a model like GPT-3 running inference on a GPU is much less efficient than this, as many of the matrix mult calls (in training, parallelizing over time) become vector matrix mult calls and thus RAM bandwidth limited.
The brain’s sparsity obviously reduces the equivalent flop count vs running the same exact model on dense mat mul hardware, but interestingly enough ends up in roughly the same “flops used for inference” regime as the smaller dense model. However it is getting by with perhaps 100x less training data (for the linguistic cortex at least).
The scaling laws indicate that performance mostly depends on net training compute, and it doesn’t matter as much as you think how you allocate that between size/params (and thus inference flops) and time/data (training steps). A larger model spends more compute per training step to learn more from less data. GPT-3 used 3e23 flops for training, whereas the linguistic cortex uses perhaps 1e21 to 1e22 (1e13 * 1e9s), but GPT-3 trains on almost 3 OOM more equivalent token data and thus can be much smaller in proportion.
So the brain is more flop efficient, but only because it’s equivalent to a much larger dense model trained on much less data.
LLMs on GPUs are heavily RAM constrained but have plentiful data so they naturally have moved to the (smaller model, trained longer regime) vs the brain. For the brain synapses are fairly cheap, but training data time is not.
I glanced at the first paper you cited, and it seems to show a very weak form of the statements you made. AFAICT their results are more like “we found brain areas that light up when the person reads ‘cat’, just like how this part of the neural net lights up when given input ‘cat’” and less like “the LLM is useful for other tasks in the same way as the neural version is useful for other tasks”. Am I confused about what the paper says, and if so, how? What sort of claim are you making?