I believe this is probably possible, but only with new techniques/breakthroughs. For various mostly economic reasons it simply hasn’t yet attracted the caliber of researcher attention required to make significant progress. It seems very difficult; using a more traditional reliable dense large cluster is much much easier. And given that the fully decentralized/distributed approach is only about 5x to 10x cheaper or more efficient even if interconnect was infinite, it’s not really worth investing in until all the other low hanging research fruit is picked.
Let’s assume we need a brain sized model, so about 1e14 params. A single modern GPU has the flops (using tensorcores) to run a model this big at real-time speed assuming non-trivial breakthrough in exploitation of both activation and weight sparsity (nobody has achieved this yet, arguably may be impossible, but let’s assume). This means the model runs at 100hz, but only 1% of the 1e14 connections are active per timestep, so it uses 1e14 flops instead of 1e16.
Then let’s also assume some simple effective compression techniques in the spirit of weight sharing allows us to compress the 1e14 params down to about 1e13 bits or on order a terrabyte of GPU RAM. This then requires model parallelization over about 64 GPUs—with each GPU then simulating about 128 million neurons and the equivalent of a trillion weights—but shared/compressed down to use only 16GB.
Next let’s assume we constrain this model to use mostly local communication similar to how the brain is such constrained. So only about 10% of our 10 billion-ish neurons have long distance connections at all, and the total long distance (inter-GPU) communication bandwidth is less than 10Gbps per brain/model instance (about 1 billion long distance paths firing at 1hz). This again assumes brain-like activation sparsity (on order a few % of neurons active per timestep).
This means we need up to 64 * 10Gbps of total inter-GPU bandwidth to simulate our 64 brain sized instances in parallel, but spread out so each GPU needs only about 10Gbps. This is easily achievable with 4 or 8 GPUs per machine, fast PCIE or NVlink intra-machine connections, and then a 32 or 16 way 10Gb network switch connecting the machines.
If we can’t achieve 32x weight compression/sharing then we need to use more GPUs or get by with a smaller model. Up to 8 GPUs per machine and a 64-way switch seems feasible, so we could scale up to 512 local GPUs. But it gets much more expensive/unrealistic past that point.
So to use millions of machines, we’d end up with thousands of these clusters, each cluster running a number of instances of one large brain-sized model. Realistically each cluster has only about a 1GBps shared connection to the wider internet, which is extremely limiting - several OOM less than the local network aggregate switch bandwidth.
Standard data parallelization would involve passing around our 1TB of (already highly compressed) weights/params every gradient step, which seems fairly unrealistic unless truly enormous batch sizes are possible.
So the most straightforward approach is probably using these thousands of clusters for parallel hyper-parameterization (which requires barely any bandwidth), but I believe there are new vastly more efficient techniques waiting to be discovered that use bandwidth somewhere in between (in a log sense) ~1e2 bit/s (distributed hyper-param exploration) and ~1e12 bit/s (full data parallel at ~1 gradient step per second).
I believe this is probably possible, but only with new techniques/breakthroughs. For various mostly economic reasons it simply hasn’t yet attracted the caliber of researcher attention required to make significant progress. It seems very difficult; using a more traditional reliable dense large cluster is much much easier. And given that the fully decentralized/distributed approach is only about 5x to 10x cheaper or more efficient even if interconnect was infinite, it’s not really worth investing in until all the other low hanging research fruit is picked.
Let’s assume we need a brain sized model, so about 1e14 params. A single modern GPU has the flops (using tensorcores) to run a model this big at real-time speed assuming non-trivial breakthrough in exploitation of both activation and weight sparsity (nobody has achieved this yet, arguably may be impossible, but let’s assume). This means the model runs at 100hz, but only 1% of the 1e14 connections are active per timestep, so it uses 1e14 flops instead of 1e16.
Then let’s also assume some simple effective compression techniques in the spirit of weight sharing allows us to compress the 1e14 params down to about 1e13 bits or on order a terrabyte of GPU RAM. This then requires model parallelization over about 64 GPUs—with each GPU then simulating about 128 million neurons and the equivalent of a trillion weights—but shared/compressed down to use only 16GB.
Next let’s assume we constrain this model to use mostly local communication similar to how the brain is such constrained. So only about 10% of our 10 billion-ish neurons have long distance connections at all, and the total long distance (inter-GPU) communication bandwidth is less than 10Gbps per brain/model instance (about 1 billion long distance paths firing at 1hz). This again assumes brain-like activation sparsity (on order a few % of neurons active per timestep).
This means we need up to 64 * 10Gbps of total inter-GPU bandwidth to simulate our 64 brain sized instances in parallel, but spread out so each GPU needs only about 10Gbps. This is easily achievable with 4 or 8 GPUs per machine, fast PCIE or NVlink intra-machine connections, and then a 32 or 16 way 10Gb network switch connecting the machines.
If we can’t achieve 32x weight compression/sharing then we need to use more GPUs or get by with a smaller model. Up to 8 GPUs per machine and a 64-way switch seems feasible, so we could scale up to 512 local GPUs. But it gets much more expensive/unrealistic past that point.
So to use millions of machines, we’d end up with thousands of these clusters, each cluster running a number of instances of one large brain-sized model. Realistically each cluster has only about a 1GBps shared connection to the wider internet, which is extremely limiting - several OOM less than the local network aggregate switch bandwidth.
Standard data parallelization would involve passing around our 1TB of (already highly compressed) weights/params every gradient step, which seems fairly unrealistic unless truly enormous batch sizes are possible.
So the most straightforward approach is probably using these thousands of clusters for parallel hyper-parameterization (which requires barely any bandwidth), but I believe there are new vastly more efficient techniques waiting to be discovered that use bandwidth somewhere in between (in a log sense) ~1e2 bit/s (distributed hyper-param exploration) and ~1e12 bit/s (full data parallel at ~1 gradient step per second).