Mir comments on Memory bandwidth constraints imply economies of scale in AI inference

Mir 21 Sep 2023 22:45 UTC
0 points
0
Wow, this is a good argument. Especially if assumptions hold.
1. The ALU computes the input much faster than the results can be moved to the next layer.
2. So if the AI only receives a single user’s prompt, the ALUs waste a lot of time waiting for input.
3. But if many users are sending prompts all the time, the ALUs can be sent many more operations at once (assuming the wires are bottlenecked by speed rather than amount of information they can carry).
4. So if your AI is extremely popular (e.g., OpenAI), your ALUs have to spend less time idling, so the GPUs you use are much more cost-effective.
5. Compute is much more expensive for less popular AIs (plausibly >1000x).