How effectively can decentralized GPUs be used to train very large (>GPT3) LLM models? I can see a mass mobilization campaign of civilian GPUs along the lines of a mandatory Fold At Home should Beijing ever get serious about the AI race.
I think it’s pretty unlikely (<5%) that decentralized volunteer training will be competitive with SOTA, ever. (Caveat: I haven’t been following volunteer training super closely so this take is mostly cached from having looked into it for GPT-Neo plus occasionally seeing new papers about volunteer training).
You are going to get an insane efficiency hit from the compute having very low bandwidth high latency interconnect. I think it’s not inconceivable that someone will eventually figure out an algorithm that is only a few times worse than training on a cluster with good interconnects but this is one of those things where people have tried for ages.
Heterogeneous compute (not all the GPUs will be the same model), lower reliability (people will turn their computers off more often than they do in datacenters), and having to be robust against bad actors (people could submit bad gradients) and other challenges together add another several times overhead.
There just isn’t that much volunteer hardware out there. For a rough OOM the publicly announced Facebook cluster is roughly the same size as the raw size of folding@home at its peak. All in all, I think you would need to do some serious engineering and research to get 1% efficiency at Facebook cluster scale.
(folding@home is embarrassingly parallelizable because it requires basically no interconnect, and therefore also doesn’t mind heterogeneous compute or reliability)
Could you link me to sources that could give me an estimate of how inefficient volunteer compute would be? Is it something like 100x or 10^6x? Mandatory (i.e. integrated into WeChat) volunteer compute (with compensation) available in China could well exceed conventional AI training clusters by several OOMs.
How effectively can decentralized GPUs be used to train very large (>GPT3) LLM models? I can see a mass mobilization campaign of civilian GPUs along the lines of a mandatory Fold At Home should Beijing ever get serious about the AI race.
I think it’s pretty unlikely (<5%) that decentralized volunteer training will be competitive with SOTA, ever. (Caveat: I haven’t been following volunteer training super closely so this take is mostly cached from having looked into it for GPT-Neo plus occasionally seeing new papers about volunteer training).
You are going to get an insane efficiency hit from the compute having very low bandwidth high latency interconnect. I think it’s not inconceivable that someone will eventually figure out an algorithm that is only a few times worse than training on a cluster with good interconnects but this is one of those things where people have tried for ages.
Heterogeneous compute (not all the GPUs will be the same model), lower reliability (people will turn their computers off more often than they do in datacenters), and having to be robust against bad actors (people could submit bad gradients) and other challenges together add another several times overhead.
There just isn’t that much volunteer hardware out there. For a rough OOM the publicly announced Facebook cluster is roughly the same size as the raw size of folding@home at its peak. All in all, I think you would need to do some serious engineering and research to get 1% efficiency at Facebook cluster scale.
(folding@home is embarrassingly parallelizable because it requires basically no interconnect, and therefore also doesn’t mind heterogeneous compute or reliability)
Could you link me to sources that could give me an estimate of how inefficient volunteer compute would be? Is it something like 100x or 10^6x? Mandatory (i.e. integrated into WeChat) volunteer compute (with compensation) available in China could well exceed conventional AI training clusters by several OOMs.