(Not an expert.) (Sorry if you answered this and I missed it.)
Let’s say a near-future high-end GPU can run as many ops/s as a human brain but has 300× less memory (RAM). Your suggestion (as I understand it) would be a small supercomputer (university cluster scale?) with 300 GPUs running (at each moment) 300 clones of one AGI at 1× human-brain-speed thinking 300 different thoughts in parallel, but getting repeatedly collapsed (somehow) into a single working memory state.
(If so, I’m not sure that you’d be getting much more out of the 300 thoughts at a time than you’d get from 1 thought at a time. One working memory state seems like a giant constraint!)
Wouldn’t it make more sense to use the same 300 GPUs to have just one human-brain-scale AGI, thinking one thought at a time, but with 300× speedup compared to humans? I know that speedup is limited by latency (both RAM --> ALU and chip --> chip) but I’m not sure what the ceiling is there. (After all, 300× faster than the brain is still insanely slow by some silicon metrics.) I imagine each chip being analogous to a contiguous 1/300th of the brain, and then evidence from the brain is that we can get by with most connections being within-chip, which helps with the chip --> chip latency at least. (I have a couple back-of-the-envelope calculations related to that topic in §6.2 here.)
The problem is that due to the VN bottlneck, to reach that performance those 300 GPUs need to be parallelizing 1000x over some problem dimension (matrix-matrix multiplication), they can’t actually just do the obvious thing you’d want—which is to simulate a single large brain (sparse RNN) at high speed (using vector-matrix multiplication). Trying that you’d just get 1 brain at real-time speed at best (1000x inefficiency/waste/slowdown). It’s not really an interconnect issue per se, it’s the VN bottleneck.
So you have to sort of pick your poison:
Parallelize over spatial dimension (CNNs) - too highly constraining for higher brain regions
Parallelize over batch/agent dimension—costly in RAM for agent medium-term memory, unless compressed somehow
Parallelize over time (transformers) - does enable huge speedup while being RAM efficient, but also highly constraining by limiting recursion
The largest advances in DL (the CNN revolution, the transformer revolution) are actually mostly about navigating this VN bottleneck, because more efficient use of GPUs trumps other considerations.
(Not an expert.) (Sorry if you answered this and I missed it.)
Let’s say a near-future high-end GPU can run as many ops/s as a human brain but has 300× less memory (RAM). Your suggestion (as I understand it) would be a small supercomputer (university cluster scale?) with 300 GPUs running (at each moment) 300 clones of one AGI at 1× human-brain-speed thinking 300 different thoughts in parallel, but getting repeatedly collapsed (somehow) into a single working memory state.
(If so, I’m not sure that you’d be getting much more out of the 300 thoughts at a time than you’d get from 1 thought at a time. One working memory state seems like a giant constraint!)
Wouldn’t it make more sense to use the same 300 GPUs to have just one human-brain-scale AGI, thinking one thought at a time, but with 300× speedup compared to humans? I know that speedup is limited by latency (both RAM --> ALU and chip --> chip) but I’m not sure what the ceiling is there. (After all, 300× faster than the brain is still insanely slow by some silicon metrics.) I imagine each chip being analogous to a contiguous 1/300th of the brain, and then evidence from the brain is that we can get by with most connections being within-chip, which helps with the chip --> chip latency at least. (I have a couple back-of-the-envelope calculations related to that topic in §6.2 here.)
The problem is that due to the VN bottlneck, to reach that performance those 300 GPUs need to be parallelizing 1000x over some problem dimension (matrix-matrix multiplication), they can’t actually just do the obvious thing you’d want—which is to simulate a single large brain (sparse RNN) at high speed (using vector-matrix multiplication). Trying that you’d just get 1 brain at real-time speed at best (1000x inefficiency/waste/slowdown). It’s not really an interconnect issue per se, it’s the VN bottleneck.
So you have to sort of pick your poison:
Parallelize over spatial dimension (CNNs) - too highly constraining for higher brain regions
Parallelize over batch/agent dimension—costly in RAM for agent medium-term memory, unless compressed somehow
Parallelize over time (transformers) - does enable huge speedup while being RAM efficient, but also highly constraining by limiting recursion
The largest advances in DL (the CNN revolution, the transformer revolution) are actually mostly about navigating this VN bottleneck, because more efficient use of GPUs trumps other considerations.