kotrfa comments on What the cost difference in processing input vs. output tokens with LLMs?

kotrfa 8 Aug 2024 16:17 UTC
1 point
0
Got it, thanks!

But to process the 1001st input token, you also need to load all the 1000 tokens in memory, forming the cache (it does happen in one step though). And for each new output token, you surely don’t dump all the existing KV cache after each generation, only to load it again to append an extra KV vectors for the last generated token. So isn’t the extra work for output tokens just that the KV cache is accessed, generated, expanded, one token at a time, and that’s where the “more work” come from?

Is there any reason why this would imply the ratio of pricing of output:input tokens being commonly something like 3:1?
- kotrfa 8 Aug 2024 16:42 UTC
  1 point
  0
  Parent
  Heh, I actually think it’s answered here.