In this imaginary world, you would use AND (within each cluster of 10 synapses) and OR (between clusters) to calculate whether dendritic spikes happen or not. Agree?
Using MACs in this imaginary world is both too complicated and too simple. It’s too complicated because it’s a very wasteful way to calculate AND. It’s too simple because it’s wrong to MAC together spatially-distant synapses, when spatially-distant synapses can’t collaboratively create a spike.
If you’re with me so far, that’s what I mean when I say that this model has “no MAC operations”.
Sure. I was focused on what I thought was the minimum computationally relevant model. As in we model every effect that matters to whether ultimately a synapse will fire or not to a good enough level that it’s within the noise threshold.
OK, if “we probably will find something way better”, do you think that the “way better” thing will also definitely need 68 TB of memory, and definitely not orders of magnitude less than 68 TB? If you think it definitely needs 68 TB of memory, no way around it, then what’s your basis for believing that? And how do you reconcile that belief with the fact that we can build deep learning models of various types that do all kinds of neat things like language modeling and motor control and speech synthesis and image recognition etc. but require ≈100-100,000× less than 68 TB of memory? How are you thinking about that?
I think I was just trying to fill in the rest of your cartoon model. No, we probably don’t need exactly that much memory, but I addressed your misconceptions about repeated algorithms applied across parallel input. You do need a copy in memory of every repeat. 4090 will not cut it.
If you wanted a tighter model, you might ask “ok how much of the brain is speech processing vs vision and robotics control”. Then you can estimate how much bigger GPT-3 has to be to also run a robot to human levels of dexterity and see.
Right now GPT-3 is 175 billion params, or 350 gigs in 16-bit. So you need something like 4-8 A/H 100s to run it. I think above I said 48 cards to hit brain level compute, and if we end up needing a lot of extra memory, 960.
With current cards that exist.
So you can optimize this down a lot, but probably not to a 4090. One way to optimize is to have a cognitive architecture made of many specialized networks, and only load the ones relevant for the current task. So the AGI needs time to “context switch” by loading the set of networks it needs from storage to memory.
Instant switching can be done as well at a larger datacenter level with a fairly obvious algorithm.
In this imaginary world, you would use AND (within each cluster of 10 synapses) and OR (between clusters) to calculate whether dendritic spikes happen or not. Agree?
Using MACs in this imaginary world is both too complicated and too simple. It’s too complicated because it’s a very wasteful way to calculate AND. It’s too simple because it’s wrong to MAC together spatially-distant synapses, when spatially-distant synapses can’t collaboratively create a spike.
If you’re with me so far, that’s what I mean when I say that this model has “no MAC operations”.
Sure. I was focused on what I thought was the minimum computationally relevant model. As in we model every effect that matters to whether ultimately a synapse will fire or not to a good enough level that it’s within the noise threshold.
OK, if “we probably will find something way better”, do you think that the “way better” thing will also definitely need 68 TB of memory, and definitely not orders of magnitude less than 68 TB? If you think it definitely needs 68 TB of memory, no way around it, then what’s your basis for believing that? And how do you reconcile that belief with the fact that we can build deep learning models of various types that do all kinds of neat things like language modeling and motor control and speech synthesis and image recognition etc. but require ≈100-100,000× less than 68 TB of memory? How are you thinking about that?
I think I was just trying to fill in the rest of your cartoon model. No, we probably don’t need exactly that much memory, but I addressed your misconceptions about repeated algorithms applied across parallel input. You do need a copy in memory of every repeat. 4090 will not cut it.
If you wanted a tighter model, you might ask “ok how much of the brain is speech processing vs vision and robotics control”. Then you can estimate how much bigger GPT-3 has to be to also run a robot to human levels of dexterity and see.
Right now GPT-3 is 175 billion params, or 350 gigs in 16-bit. So you need something like 4-8 A/H 100s to run it. I think above I said 48 cards to hit brain level compute, and if we end up needing a lot of extra memory, 960.
With current cards that exist.
So you can optimize this down a lot, but probably not to a 4090. One way to optimize is to have a cognitive architecture made of many specialized networks, and only load the ones relevant for the current task. So the AGI needs time to “context switch” by loading the set of networks it needs from storage to memory.
Instant switching can be done as well at a larger datacenter level with a fairly obvious algorithm.