+1 to Cannell’s answer, and I’ll also add pipelining.
Let’s say (one instance of) the system is distributed across 10 GPUs, arranged in series—to to do a forward pass, the first GPU does some stuff, passes its result to the second GPU, which passes to the third, etc. If only one user at a time were being serviced, then 90% of those GPUs would be idle at any given time. But pipelining means that, once the first GPU in line has finished one request (or, realistically, batch of requests), it can immediately start on another batch of requests.
More generally: the rough estimate in the post above tries to estimate throughput from latency, which doesn’t really work. Parallelism/pipelining mean that latency isn’t a good way to measure throughput, unless we also know how many requests are processed in parallel at a time.
(Also I have been operating under the assumption that OpenAI is not profitable at-the-margin, and I’m curious to see an estimate.)
+1 to Cannell’s answer, and I’ll also add pipelining.
Let’s say (one instance of) the system is distributed across 10 GPUs, arranged in series—to to do a forward pass, the first GPU does some stuff, passes its result to the second GPU, which passes to the third, etc. If only one user at a time were being serviced, then 90% of those GPUs would be idle at any given time. But pipelining means that, once the first GPU in line has finished one request (or, realistically, batch of requests), it can immediately start on another batch of requests.
More generally: the rough estimate in the post above tries to estimate throughput from latency, which doesn’t really work. Parallelism/pipelining mean that latency isn’t a good way to measure throughput, unless we also know how many requests are processed in parallel at a time.
(Also I have been operating under the assumption that OpenAI is not profitable at-the-margin, and I’m curious to see an estimate.)