Don’t global clock speeds have to go down as die area goes up due to the speed of light constraint?
Yes if you use 1 die with 1 clock domain, they would. Modern chips don’t.
For instance, if you made a die with 1e15 MAC units and the area scaled linearly, you would be looking at a die that’s ~ 2e9 times larger than H100′s die size, which is about 1000 mm^2. The physical dimensions of such a die would be around 2 km^2, so the speed of light would limit global clock frequencies to something on the order of c/(1 km) ~= 300 kHz, which is not 1 million times faster than the 1 kHz you attribute to the human brain.
So actually you would make something like 10,000 dies, the MAC units spread between them. Clusters of dies are calculating a single layer of the network. There are many optical interconnects between the modules, using silicon photonics, like this:
I was focused on the actual individual synapse part of it. The key thing to realize that the MAC is the slowest step, multiplications are complicated and take a lot of silicon, and I am implicitly assuming all other steps are cheap and fast. Note that it’s also a pipelined system with enough hardware there aren’t any stalls, the moment a MAC is done it’s processing the next input on the next clock, and so on. It’s what hardware engineers would do if they had an unlimited budget for silicon.
Maybe when the clock frequencies get this low, you’re dissipating so little heat that you can go 3D without worrying too much about heating issues and that buys you something. Still, your argument here doesn’t seem that obvious to me, especially if you consider the fact that one round trip for one clock is extremely optimistic if you’re trying to do all MACs at once. Remember that GPT-3 is a sequential model; you can’t perform all the ops in one clock because later layers need to know what the earlier layers have computed.
No no no. Each layer runs 1 million times faster. If a human brain needs 10 seconds to think of something, that’s enough time for the AI equivalent to compute the activations of 10 million layers, or many forward passes over the information, reflection on the updated memory buffer from the prior steps, and so on.
Or the easier way to view it is : say a human responds to an event in 300 ms.
Optic nerve is 13 m/s. Average length 45 mm. 3.46 milliseconds to reach the visual cortex.
Then for a signal to reach a limb, 100 milliseconds. So we have 187 milliseconds for neural network layers.
Assume the synapses run at 1 khz total, so that means in theory there can be 187 layers of neural network between the visual cortex and the output nerves going to alimb.
So to run this 1 million times faster, we use hollow core optical fibers, and we need to travel 1 meter total of cabling. That will take 3.35 nanoseconds, which is 89 million times faster.
Ok, can we evaluate a 187 layer neural network 1 million times faster than 187 milliseconds?
Well if we take 1 clock for the MACs, we compute the activations with 9 clocks of delay including clock domain syncs, you need 10 ghz for 1 million fold speedup.
But oh wait, we have better wiring. So actually we have 299 milliseconds of budget for 187 neural network layers.
This means we only need 6.25 ghz. Or we may be able to do a layer in less than 10 clocks.
Overall I think your comment here is quite speculative. It may or may not be true, I think we’ll see, but people shouldn’t treat it as if this is obviously something that’s feasible to do.
There is nothing speculative or novel here about the overall idea of using ASICs to accelerate compute tasks. The reason it won’t be done this way is because you can extract a lot more economic value for your (silicon + power consumption) with slower systems that share hardware.
I mean right now there’s no value at all in doing this, the AI model is just going to go off the rails and be deeply wrong within a few minutes at current speeds, or get stuck in a loop it doesn’t know it’s in because of short context. Faster serial speed isn’t helpful.
In the future, you have to realize that even if the brain of some robotic machine can respond 1 million times faster, physics do not allow the hardware to move that fast, or the cameras to even register a new frame by accumulating photons that fast. It’s not a useful speed. It also applies similarly to R&D tasks—at much lower speedups the task will be limited by the physical world.
Yes if you use 1 die with 1 clock domain, they would. Modern chips don’t.
So actually you would make something like 10,000 dies, the MAC units spread between them. Clusters of dies are calculating a single layer of the network. There are many optical interconnects between the modules, using silicon photonics, like this:
I was focused on the actual individual synapse part of it. The key thing to realize that the MAC is the slowest step, multiplications are complicated and take a lot of silicon, and I am implicitly assuming all other steps are cheap and fast. Note that it’s also a pipelined system with enough hardware there aren’t any stalls, the moment a MAC is done it’s processing the next input on the next clock, and so on. It’s what hardware engineers would do if they had an unlimited budget for silicon.
No no no. Each layer runs 1 million times faster. If a human brain needs 10 seconds to think of something, that’s enough time for the AI equivalent to compute the activations of 10 million layers, or many forward passes over the information, reflection on the updated memory buffer from the prior steps, and so on.
Or the easier way to view it is : say a human responds to an event in 300 ms.
Optic nerve is 13 m/s. Average length 45 mm. 3.46 milliseconds to reach the visual cortex.
Then for a signal to reach a limb, 100 milliseconds. So we have 187 milliseconds for neural network layers.
Assume the synapses run at 1 khz total, so that means in theory there can be 187 layers of neural network between the visual cortex and the output nerves going to alimb.
So to run this 1 million times faster, we use hollow core optical fibers, and we need to travel 1 meter total of cabling. That will take 3.35 nanoseconds, which is 89 million times faster.
Ok, can we evaluate a 187 layer neural network 1 million times faster than 187 milliseconds?
Well if we take 1 clock for the MACs, we compute the activations with 9 clocks of delay including clock domain syncs, you need 10 ghz for 1 million fold speedup.
But oh wait, we have better wiring. So actually we have 299 milliseconds of budget for 187 neural network layers.
This means we only need 6.25 ghz. Or we may be able to do a layer in less than 10 clocks.
There is nothing speculative or novel here about the overall idea of using ASICs to accelerate compute tasks. The reason it won’t be done this way is because you can extract a lot more economic value for your (silicon + power consumption) with slower systems that share hardware.
I mean right now there’s no value at all in doing this, the AI model is just going to go off the rails and be deeply wrong within a few minutes at current speeds, or get stuck in a loop it doesn’t know it’s in because of short context. Faster serial speed isn’t helpful.
In the future, you have to realize that even if the brain of some robotic machine can respond 1 million times faster, physics do not allow the hardware to move that fast, or the cameras to even register a new frame by accumulating photons that fast. It’s not a useful speed. It also applies similarly to R&D tasks—at much lower speedups the task will be limited by the physical world.