tjbai comments on What the cost difference in processing input vs. output tokens with LLMs?

tjbai 8 Aug 2024 17:08 UTC
1 point
0
Output tokens certainly do not scale linearly, even with a KV cache. The KV cache means you don’t need to recompute the k/q/v vectors for each of the previous tokens, but you still need to compute n kq dot products for the (n+1)’st token.