While there are rareexamples of deep learning algorithms that can scale in this way, in practice current relative resource costs of compute-vs-storage-vs-bandwidth don’t work out in favor of this kind of topology.
First problem: Current deep learning systems require a bunch of very dense networking with a high degree of connectivity between nodes, high bandwidth over these connections, and low latency. It’s possible some sort of tweak or modification to the architecture and algorithm would allow for training on compute that’s got slow/high-latency/low-bandwidth connections, but I don’t know of any right now.
Second problem: Current distributed training approaches strongly require almost all of the compute to be available and reliable. If nodes are stochastically coming and going, it’s hard to know where to route data / how long to wait to accumulate gradients before giving up / who has what copy of the latest parameters. This seems also solveable with engineering, but engineering distributed systems to deal with node failures is a headache.
Third problem: Trust in the computing. Deep neural networks are pretty sensitive to data poisoning and things like adversarial examples. I expect that for almost any efficient distributed training setup, a clever attacker would be able to figure out a way to subtly introduce unnoticed, unwanted changes to the training. In general I think a patient attacker could make changes subtle enough that they’re almost always below some threshold of validation, but I don’t have a proof for this.
Maybe there are other problems that I missed, but at the very least each one of those independently would make me not want to train my large models on a setup like this.
While there are rare examples of deep learning algorithms that can scale in this way, in practice current relative resource costs of compute-vs-storage-vs-bandwidth don’t work out in favor of this kind of topology.
First problem: Current deep learning systems require a bunch of very dense networking with a high degree of connectivity between nodes, high bandwidth over these connections, and low latency. It’s possible some sort of tweak or modification to the architecture and algorithm would allow for training on compute that’s got slow/high-latency/low-bandwidth connections, but I don’t know of any right now.
Second problem: Current distributed training approaches strongly require almost all of the compute to be available and reliable. If nodes are stochastically coming and going, it’s hard to know where to route data / how long to wait to accumulate gradients before giving up / who has what copy of the latest parameters. This seems also solveable with engineering, but engineering distributed systems to deal with node failures is a headache.
Third problem: Trust in the computing. Deep neural networks are pretty sensitive to data poisoning and things like adversarial examples. I expect that for almost any efficient distributed training setup, a clever attacker would be able to figure out a way to subtly introduce unnoticed, unwanted changes to the training. In general I think a patient attacker could make changes subtle enough that they’re almost always below some threshold of validation, but I don’t have a proof for this.
Maybe there are other problems that I missed, but at the very least each one of those independently would make me not want to train my large models on a setup like this.