Perhaps what you meant is that latency will be high but this isn’t a problem as long as you have high throughput. That’s is basically true for training. But this post is about inference where latency matters a lot more.
(It depends on the application of course, but the ZeRO Infinity approach can make your model so slow that you don’t want to interact with it in real time, even at GPT-3 scale)
Perhaps what you meant is that latency will be high but this isn’t a problem as long as you have high throughput. That’s is basically true for training. But this post is about inference where latency matters a lot more.
(It depends on the application of course, but the ZeRO Infinity approach can make your model so slow that you don’t want to interact with it in real time, even at GPT-3 scale)