A simple stylized example: imagine you have some algorithm for processing each cluster of inputs from the retina.
You might think that because that algorithm is symmetric* - you want to run the same algorithm regardless of which cluster it is—you only need one copy of the bytecode that represents the compiled copy of the algorithm.
This is not the case. Information wise, sure. There is only one program that takes n bytes of information. You can save disk space for holding your model.
RAM/cache consumption : each of the parallel processing units you have to use (you will not get realtime results for images if you try to do it serially) must have another copy of the algorithm.
And this rule applies throughout the human body : every nerve cluster, audio processing, etc.
This also greatly expands the memory required over your 24 gig 4090 example. For one thing, the human brain is very sparse, and while nvidia has managed to improve sparse network performance, it still requires memory to represent all the sparse values.
I might note that you could have tried to fill in the “cartoon switch” for human synapses. They are likely a MAC for each incoming axon at no better than 8 bits of precision added to an accumulator for the cell membrane at the synapse that has no better than 16 bits of precision. (it’s probably less but the digital version has to use 16 bits)
So add up the number of synapses in the human brain, assume 1 khz, and that’s how many TOPs you need.
Let me do the math for you real quick:
68 billion neurons, about 1000 connections each, at 1khz. (it’s very sparse). So we need 68 billion x 1000 x 1000 = 6.8e+16 = 68000 TOPs.
So we would need 17 of them to hit 1 human brain. We can assume you will never get maximum performance (especially due to the very high sparsity), so maybe 2-3 nodes with 16 cards each?
Note they would have a total of 3840 gigabytes of memory.
Since we have 68 billion x 1000 x (1 byte) = 68 terabytes of weights in a brain, that’s the problem. We only have 5% as much memory as we need.
This is the reason for neuromorphic compute based on SSDs: compute’s not the bottleneck, memory is.
We can get brain scale perf with 20 times as much hardware, or 20*3*16 = 960 A100s. They are 25k each so 24 million for the GPUs, plus all the motherboards and processors and rack space. Maybe 50 million?
That’s a drop in the bucket and easily affordable by current AI companies.
Epistemic notes: I’m a computer engineer (CS masters/ml) and I work on inference accelerators.
it’s not symmetric—the retinal density varies by position in the eye
Thanks for your comment! I am not a GPU expert, if you didn’t notice. :)
I might note that you could have tried to fill in the “cartoon switch” for human synapses. They are likely a MAC for each incoming axon…
This is the part I disagree with. For example, in the OP I cited this paper which has no MAC operations, just AND & OR. More importantly, you’re implicitly assuming that whatever neocortical neurons are doing, the best way to do that same thing on a chip is to have a superficial 1-to-1 mapping between neurons-in-the-brain and virtual-neurons-on-the-chip. I find that unlikely. Back to that paper just above, things happening in the brain are (supposedly) encoded as random sparse subsets of active neurons drawn from a giant pool of neurons. We could do that on the chip, if we wanted to, but we don’t have to! We could assign them serial numbers instead! We can do whatever we want! Also, cortical neurons are arranged into six layers vertically, and in the other direction, 100 neurons are tied into a closely-interconnected cortical minicolumn, and 100 minicolumns in turn form a cortical column. There’s a lot of structure there! Nobody really knows, but my best guess from what I’ve seen is that a future programmer might have one functional unit in the learning algorithm called a “minicolumn” and it’s doing, umm, whatever it is that minicolumns do, but we don’t need to implement that minicolumn in our code by building it out of 100 different interconnected virtual neurons. Yes the brain builds it that way, but the brain has lots of constraints that we won’t have when we’re writing our own code—for example, a GPU instruction set can do way more things than biological neurons can (partly because biological neurons are so insanely slow that any operation that requires more than a couple serial steps is a nonstarter).
Please read a neuroscience book, even an introductory one, on how a synapse works. Just 1 chapter, even.
There’s a MAC in there. It’s because the incoming action potential hits the synapse, and sends a certain quantity of neurotransmitters across a gap. The sender cell can vary how much neurotransmitter it sends, and the receiving cell can vary how many active receptors it has. The type of neurotransmitter determines the gain and sign. (this is like the exponent and sign bit for 8 bit BFloat)
These 2 variables can be combined to a single coefficient, you can think of it as “voltage delta” (it can be + or -)
So it’s (1) * (voltage gain) = change in target cell voltage.
For ANN, it’s <activation output> * <weight> = change in target node activation input.
The brain also uses timing to get more information than just “1”, the exact time the pulse arrived matters to a certain amount of resolution. It is NOT infinite, for reasons I can explain if you want.
So the final equation is (1) * (synapse state) * (voltage gain) = change in target cell voltage.
Aka you have to multiply 2 numbers together and add, which is what “multiply-accumulate” units do.
Due to all the horrible electrical noise in the brain, and biological forms of noise and contaminants, and other factors, this is the reason for me making it only 8 bits − 1 part in 256 - of precision. That’s realistically probably generous, it’s probably not even that good.
There is immense amounts of additional complexity in the brain, but almost none of this matters for determining inference outputs. The action potentials rush out of the synapse at kilometers per second—many biological processes just don’t matter at all because of this. Same how a transistor’s behavior is irrelevant, it’s a cartoon switch.
For training, sure, if we wanted a system to work like a brain we’d have to model some of this, but we don’t. We can train using whatever algorithm measurably is optimal.
Similarly we never have to bother with a “minicolumn”. We only care about what works best. Notice how human aerospace engineers never developed flapping wings for passenger aircraft, because they do not work all that well.
We probably will find something way better than a minicolumn. Some argue that’s what a transformer is.
I’ve spent thousands of hours reading neuroscience papers, I know how synapses work, jeez :-P
Similarly we never have to bother with a “minicolumn”. We only care about what works best. Notice how human aerospace engineers never developed flapping wings for passenger aircraft, because they do not work all that well.
We probably will find something way better than a minicolumn. Some argue that’s what a transformer is.
I’m sorta confused that you wrote all these paragraphs with (as I understand it) the message that if we want future AGI algorithms to do the same things that a brain can do, then it needs to do MAC operations in the same way that (you claim) brain synapses do, and it needs to have 68 TB of weight storage just as (you claim) the brain does. …But then here at the end you seem to do a 180° flip and talk about flapping wings and transformers and “We probably will find something way better”. OK, if “we probably will find something way better”, do you think that the “way better” thing will also definitely need 68 TB of memory, and definitely not orders of magnitude less than 68 TB? If you think it definitely needs 68 TB of memory, no way around it, then what’s your basis for believing that? And how do you reconcile that belief with the fact that we can build deep learning models of various types that do all kinds of neat things like language modeling and motor control and speech synthesis and image recognition etc. but require ≈100-100,000× less than 68 TB of memory? How are you thinking about that? (Maybe you have a “scale-is-all-you-need” perspective, and you note that we don’t have AGI yet, and therefore the explanation must be “insufficient scale”? Or something else?)
There’s a MAC in there.
OK, imagine for the sake of argument that we live in the following world (a caricatured version of this model):
Dendrites have lots of clusters of 10 nearby synapses
Iff all 10 synapses within one cluster get triggered simultaneously, then it triggers a dendritic spike on the downstream neuron.
Different clusters on the same dendritic tree can each be treated independently
As background, the whole dendrite doesn’t have a single voltage (let alone the whole dendritic tree). Dendrites have different voltages in different places. If there are multiple synaptic firings that are very close in both time and space, then the voltages can add up and get past the spike threshold; but if multiple synapses that are very far apart from each other fire simultaneously, they don’t add up, they each affect the voltage in their own little area, and it doesn’t create a dendritic spike.
The upstream neurons are all firing on a regular clock cycle, such that the synapse firing is either “simultaneous” or “so far apart in time that we can treat each timestep independently”.
In this imaginary world, you would use AND (within each cluster of 10 synapses) and OR (between clusters) to calculate whether dendritic spikes happen or not. Agree?
Using MACs in this imaginary world is both too complicated and too simple. It’s too complicated because it’s a very wasteful way to calculate AND. It’s too simple because it’s wrong to MAC together spatially-distant synapses, when in fact spatially-distant synapses can’t collaboratively create a spike.
If you’re with me so far, that’s what I mean when I say that this model has “no MAC operations”.
And by the way, I think we could reformulate this same algorithm to have a very different low-level implementation (but the same input and output), by replacing “groups of neurons that form clusters together” with “serial numbers”. Then there would be no MACs and there would be no multi-synapse ANDs, but rather there would be various hash tables or something, I dunno. And the memory requirements would be different, as would the number of required operations, presumably.
At this point maybe you’re going to reply “OK but that’s an imaginary world, whereas I want to talk about the real world.” Certainly the bullet points above are erasing real-world complexities. But it’s very difficult to judge which real-world complexities are actually playing an important role in brain algorithms and which aren’t. For example, should we treat (certain classes of) cortical synapses as having binary strength rather than smoothly-varying strength? That’s a longstanding controversy! Do neurons really form discrete and completely-noninteracting clusters on dendrites? I doubt it…but maybe the brain would work better if they did!! What about all the other things going on in the cortex? That’s a hard question. There are definitely other things going on unrelated to this particular model, but it’s controversial exactly what they are.
In this imaginary world, you would use AND (within each cluster of 10 synapses) and OR (between clusters) to calculate whether dendritic spikes happen or not. Agree?
Using MACs in this imaginary world is both too complicated and too simple. It’s too complicated because it’s a very wasteful way to calculate AND. It’s too simple because it’s wrong to MAC together spatially-distant synapses, when spatially-distant synapses can’t collaboratively create a spike.
If you’re with me so far, that’s what I mean when I say that this model has “no MAC operations”.
Sure. I was focused on what I thought was the minimum computationally relevant model. As in we model every effect that matters to whether ultimately a synapse will fire or not to a good enough level that it’s within the noise threshold.
OK, if “we probably will find something way better”, do you think that the “way better” thing will also definitely need 68 TB of memory, and definitely not orders of magnitude less than 68 TB? If you think it definitely needs 68 TB of memory, no way around it, then what’s your basis for believing that? And how do you reconcile that belief with the fact that we can build deep learning models of various types that do all kinds of neat things like language modeling and motor control and speech synthesis and image recognition etc. but require ≈100-100,000× less than 68 TB of memory? How are you thinking about that?
I think I was just trying to fill in the rest of your cartoon model. No, we probably don’t need exactly that much memory, but I addressed your misconceptions about repeated algorithms applied across parallel input. You do need a copy in memory of every repeat. 4090 will not cut it.
If you wanted a tighter model, you might ask “ok how much of the brain is speech processing vs vision and robotics control”. Then you can estimate how much bigger GPT-3 has to be to also run a robot to human levels of dexterity and see.
Right now GPT-3 is 175 billion params, or 350 gigs in 16-bit. So you need something like 4-8 A/H 100s to run it. I think above I said 48 cards to hit brain level compute, and if we end up needing a lot of extra memory, 960.
With current cards that exist.
So you can optimize this down a lot, but probably not to a 4090. One way to optimize is to have a cognitive architecture made of many specialized networks, and only load the ones relevant for the current task. So the AGI needs time to “context switch” by loading the set of networks it needs from storage to memory.
Instant switching can be done as well at a larger datacenter level with a fairly obvious algorithm.
Hi Steven.
A simple stylized example: imagine you have some algorithm for processing each cluster of inputs from the retina.
You might think that because that algorithm is symmetric* - you want to run the same algorithm regardless of which cluster it is—you only need one copy of the bytecode that represents the compiled copy of the algorithm.
This is not the case. Information wise, sure. There is only one program that takes n bytes of information. You can save disk space for holding your model.
RAM/cache consumption : each of the parallel processing units you have to use (you will not get realtime results for images if you try to do it serially) must have another copy of the algorithm.
And this rule applies throughout the human body : every nerve cluster, audio processing, etc.
This also greatly expands the memory required over your 24 gig 4090 example. For one thing, the human brain is very sparse, and while nvidia has managed to improve sparse network performance, it still requires memory to represent all the sparse values.
I might note that you could have tried to fill in the “cartoon switch” for human synapses. They are likely a MAC for each incoming axon at no better than 8 bits of precision added to an accumulator for the cell membrane at the synapse that has no better than 16 bits of precision. (it’s probably less but the digital version has to use 16 bits)
So add up the number of synapses in the human brain, assume 1 khz, and that’s how many TOPs you need.
Let me do the math for you real quick:
68 billion neurons, about 1000 connections each, at 1khz. (it’s very sparse). So we need 68 billion x 1000 x 1000 = 6.8e+16 = 68000 TOPs.
Current gen data center GPU: https://www.nvidia.com/en-us/data-center/h100/
So we would need 17 of them to hit 1 human brain. We can assume you will never get maximum performance (especially due to the very high sparsity), so maybe 2-3 nodes with 16 cards each?
Note they would have a total of 3840 gigabytes of memory.
Since we have 68 billion x 1000 x (1 byte) = 68 terabytes of weights in a brain, that’s the problem. We only have 5% as much memory as we need.
This is the reason for neuromorphic compute based on SSDs: compute’s not the bottleneck, memory is.
We can get brain scale perf with 20 times as much hardware, or 20*3*16 = 960 A100s. They are 25k each so 24 million for the GPUs, plus all the motherboards and processors and rack space. Maybe 50 million?
That’s a drop in the bucket and easily affordable by current AI companies.
Epistemic notes: I’m a computer engineer (CS masters/ml) and I work on inference accelerators.
it’s not symmetric—the retinal density varies by position in the eye
Thanks for your comment! I am not a GPU expert, if you didn’t notice. :)
This is the part I disagree with. For example, in the OP I cited this paper which has no MAC operations, just AND & OR. More importantly, you’re implicitly assuming that whatever neocortical neurons are doing, the best way to do that same thing on a chip is to have a superficial 1-to-1 mapping between neurons-in-the-brain and virtual-neurons-on-the-chip. I find that unlikely. Back to that paper just above, things happening in the brain are (supposedly) encoded as random sparse subsets of active neurons drawn from a giant pool of neurons. We could do that on the chip, if we wanted to, but we don’t have to! We could assign them serial numbers instead! We can do whatever we want! Also, cortical neurons are arranged into six layers vertically, and in the other direction, 100 neurons are tied into a closely-interconnected cortical minicolumn, and 100 minicolumns in turn form a cortical column. There’s a lot of structure there! Nobody really knows, but my best guess from what I’ve seen is that a future programmer might have one functional unit in the learning algorithm called a “minicolumn” and it’s doing, umm, whatever it is that minicolumns do, but we don’t need to implement that minicolumn in our code by building it out of 100 different interconnected virtual neurons. Yes the brain builds it that way, but the brain has lots of constraints that we won’t have when we’re writing our own code—for example, a GPU instruction set can do way more things than biological neurons can (partly because biological neurons are so insanely slow that any operation that requires more than a couple serial steps is a nonstarter).
Please read a neuroscience book, even an introductory one, on how a synapse works. Just 1 chapter, even.
There’s a MAC in there. It’s because the incoming action potential hits the synapse, and sends a certain quantity of neurotransmitters across a gap. The sender cell can vary how much neurotransmitter it sends, and the receiving cell can vary how many active receptors it has. The type of neurotransmitter determines the gain and sign. (this is like the exponent and sign bit for 8 bit BFloat)
These 2 variables can be combined to a single coefficient, you can think of it as “voltage delta” (it can be + or -)
So it’s (1) * (voltage gain) = change in target cell voltage.
For ANN, it’s <activation output> * <weight> = change in target node activation input.
The brain also uses timing to get more information than just “1”, the exact time the pulse arrived matters to a certain amount of resolution. It is NOT infinite, for reasons I can explain if you want.
So the final equation is (1) * (synapse state) * (voltage gain) = change in target cell voltage.
Aka you have to multiply 2 numbers together and add, which is what “multiply-accumulate” units do.
Due to all the horrible electrical noise in the brain, and biological forms of noise and contaminants, and other factors, this is the reason for me making it only 8 bits − 1 part in 256 - of precision. That’s realistically probably generous, it’s probably not even that good.
There is immense amounts of additional complexity in the brain, but almost none of this matters for determining inference outputs. The action potentials rush out of the synapse at kilometers per second—many biological processes just don’t matter at all because of this. Same how a transistor’s behavior is irrelevant, it’s a cartoon switch.
For training, sure, if we wanted a system to work like a brain we’d have to model some of this, but we don’t. We can train using whatever algorithm measurably is optimal.
Similarly we never have to bother with a “minicolumn”. We only care about what works best. Notice how human aerospace engineers never developed flapping wings for passenger aircraft, because they do not work all that well.
We probably will find something way better than a minicolumn. Some argue that’s what a transformer is.
I’ve spent thousands of hours reading neuroscience papers, I know how synapses work, jeez :-P
I’m sorta confused that you wrote all these paragraphs with (as I understand it) the message that if we want future AGI algorithms to do the same things that a brain can do, then it needs to do MAC operations in the same way that (you claim) brain synapses do, and it needs to have 68 TB of weight storage just as (you claim) the brain does. …But then here at the end you seem to do a 180° flip and talk about flapping wings and transformers and “We probably will find something way better”. OK, if “we probably will find something way better”, do you think that the “way better” thing will also definitely need 68 TB of memory, and definitely not orders of magnitude less than 68 TB? If you think it definitely needs 68 TB of memory, no way around it, then what’s your basis for believing that? And how do you reconcile that belief with the fact that we can build deep learning models of various types that do all kinds of neat things like language modeling and motor control and speech synthesis and image recognition etc. but require ≈100-100,000× less than 68 TB of memory? How are you thinking about that? (Maybe you have a “scale-is-all-you-need” perspective, and you note that we don’t have AGI yet, and therefore the explanation must be “insufficient scale”? Or something else?)
OK, imagine for the sake of argument that we live in the following world (a caricatured version of this model):
Dendrites have lots of clusters of 10 nearby synapses
Iff all 10 synapses within one cluster get triggered simultaneously, then it triggers a dendritic spike on the downstream neuron.
Different clusters on the same dendritic tree can each be treated independently
As background, the whole dendrite doesn’t have a single voltage (let alone the whole dendritic tree). Dendrites have different voltages in different places. If there are multiple synaptic firings that are very close in both time and space, then the voltages can add up and get past the spike threshold; but if multiple synapses that are very far apart from each other fire simultaneously, they don’t add up, they each affect the voltage in their own little area, and it doesn’t create a dendritic spike.
The upstream neurons are all firing on a regular clock cycle, such that the synapse firing is either “simultaneous” or “so far apart in time that we can treat each timestep independently”.
In this imaginary world, you would use AND (within each cluster of 10 synapses) and OR (between clusters) to calculate whether dendritic spikes happen or not. Agree?
Using MACs in this imaginary world is both too complicated and too simple. It’s too complicated because it’s a very wasteful way to calculate AND. It’s too simple because it’s wrong to MAC together spatially-distant synapses, when in fact spatially-distant synapses can’t collaboratively create a spike.
If you’re with me so far, that’s what I mean when I say that this model has “no MAC operations”.
And by the way, I think we could reformulate this same algorithm to have a very different low-level implementation (but the same input and output), by replacing “groups of neurons that form clusters together” with “serial numbers”. Then there would be no MACs and there would be no multi-synapse ANDs, but rather there would be various hash tables or something, I dunno. And the memory requirements would be different, as would the number of required operations, presumably.
At this point maybe you’re going to reply “OK but that’s an imaginary world, whereas I want to talk about the real world.” Certainly the bullet points above are erasing real-world complexities. But it’s very difficult to judge which real-world complexities are actually playing an important role in brain algorithms and which aren’t. For example, should we treat (certain classes of) cortical synapses as having binary strength rather than smoothly-varying strength? That’s a longstanding controversy! Do neurons really form discrete and completely-noninteracting clusters on dendrites? I doubt it…but maybe the brain would work better if they did!! What about all the other things going on in the cortex? That’s a hard question. There are definitely other things going on unrelated to this particular model, but it’s controversial exactly what they are.
In this imaginary world, you would use AND (within each cluster of 10 synapses) and OR (between clusters) to calculate whether dendritic spikes happen or not. Agree?
Using MACs in this imaginary world is both too complicated and too simple. It’s too complicated because it’s a very wasteful way to calculate AND. It’s too simple because it’s wrong to MAC together spatially-distant synapses, when spatially-distant synapses can’t collaboratively create a spike.
If you’re with me so far, that’s what I mean when I say that this model has “no MAC operations”.
Sure. I was focused on what I thought was the minimum computationally relevant model. As in we model every effect that matters to whether ultimately a synapse will fire or not to a good enough level that it’s within the noise threshold.
OK, if “we probably will find something way better”, do you think that the “way better” thing will also definitely need 68 TB of memory, and definitely not orders of magnitude less than 68 TB? If you think it definitely needs 68 TB of memory, no way around it, then what’s your basis for believing that? And how do you reconcile that belief with the fact that we can build deep learning models of various types that do all kinds of neat things like language modeling and motor control and speech synthesis and image recognition etc. but require ≈100-100,000× less than 68 TB of memory? How are you thinking about that?
I think I was just trying to fill in the rest of your cartoon model. No, we probably don’t need exactly that much memory, but I addressed your misconceptions about repeated algorithms applied across parallel input. You do need a copy in memory of every repeat. 4090 will not cut it.
If you wanted a tighter model, you might ask “ok how much of the brain is speech processing vs vision and robotics control”. Then you can estimate how much bigger GPT-3 has to be to also run a robot to human levels of dexterity and see.
Right now GPT-3 is 175 billion params, or 350 gigs in 16-bit. So you need something like 4-8 A/H 100s to run it. I think above I said 48 cards to hit brain level compute, and if we end up needing a lot of extra memory, 960.
With current cards that exist.
So you can optimize this down a lot, but probably not to a 4090. One way to optimize is to have a cognitive architecture made of many specialized networks, and only load the ones relevant for the current task. So the AGI needs time to “context switch” by loading the set of networks it needs from storage to memory.
Instant switching can be done as well at a larger datacenter level with a fairly obvious algorithm.