Presumably bandwidth requirements can be reduced a lot through width-wise parallelism.
Total PCIe bandwidth for even a Threadripper Pro platform (128 lanes of gen4 pcie) is ~250GB/s. Most other platforms have less (especially Intel, which likes to market-segment by restricting the number of pcie lanes).
Gen5 and gen6 PCIe in theory will double this and double this again—but on a multiyear cadence at best.
Meanwhile GPT-3 is ~300GB compressed, and model size seems to keep increasing.
My point is that, while PCIe bandwidths aren’t increasing very quickly, it’s easy to increase the number of machines you use. So you can distribute each NN layer (width-wise) across many machines, each of which adds to the total bandwidth you have.
(As noted in the previous comment, you can do this with <<300GB of total GPU memory for GPT-3 with something like ZeRO-infinity)
Total PCIe bandwidth for even a Threadripper Pro platform (128 lanes of gen4 pcie) is ~250GB/s. Most other platforms have less (especially Intel, which likes to market-segment by restricting the number of pcie lanes).
Gen5 and gen6 PCIe in theory will double this and double this again—but on a multiyear cadence at best.
Meanwhile GPT-3 is ~300GB compressed, and model size seems to keep increasing.
Hence: beware bandwidth bottlenecks.
My point is that, while PCIe bandwidths aren’t increasing very quickly, it’s easy to increase the number of machines you use. So you can distribute each NN layer (width-wise) across many machines, each of which adds to the total bandwidth you have.
(As noted in the previous comment, you can do this with <<300GB of total GPU memory for GPT-3 with something like ZeRO-infinity)