SoerenMind comments on Inference cost limits the impact of ever larger models

SoerenMind 11 Nov 2021 18:42 UTC
1 point

Beware bandwidth bottlenecks, as I mentioned in my original post.

Presumably bandwidth requirements can be reduced a lot through width-wise parallelism. Each GPU only has to load one slice of the model then. Of course you’ll need more GPUs then but still not a crazy number as long as you use something like ZeRO-infinity.

(Yes, 8x gpu->gpu communications will hurt overall latency… but not by all that much I don’t think. 1 second is an eternity.)

Width-wise communication, if you mean that, can be quite a latency bottleneck for training. And it gets worse when you make the model wider or the batch bigger, which of course people are constantly doing. But for inference I guess you can reduce the latency if you’re willing to use a small batch size.
- TLW 14 Nov 2021 19:59 UTC
  1 point
  Parent
  Presumably bandwidth requirements can be reduced a lot through width-wise parallelism.
  Total PCIe bandwidth for even a Threadripper Pro platform (128 lanes of gen4 pcie) is ~250GB/s. Most other platforms have less (especially Intel, which likes to market-segment by restricting the number of pcie lanes).
  Gen5 and gen6 PCIe in theory will double this and double this again—but on a multiyear cadence at best.
  Meanwhile GPT-3 is ~300GB compressed, and model size seems to keep increasing.
  Hence: beware bandwidth bottlenecks.
  - SoerenMind 15 Nov 2021 10:17 UTC
    2 points
    Parent
    My point is that, while PCIe bandwidths aren’t increasing very quickly, it’s easy to increase the number of machines you use. So you can distribute each NN layer (width-wise) across many machines, each of which adds to the total bandwidth you have.
    
    (As noted in the previous comment, you can do this with <<300GB of total GPU memory for GPT-3 with something like ZeRO-infinity)