With KV caching, it costs almost exactly as many FLOPs to take 100 input tokens and generate 900 output tokens, as to take 900 input tokens and generate 100 output tokens. However, you need a lot more memory/memory bandwidth to process an output token than an input token, because to process an output token you also need to fit the KV cache in memory.
But to process the 1001st input token, you also need to load all the 1000 tokens in memory, forming the cache (it does happen in one step though). And for each new output token, you surely don’t dump all the existing KV cache after each generation, only to load it again to append an extra KV vectors for the last generated token. So isn’t the extra work for output tokens just that the KV cache is accessed, generated, expanded, one token at a time, and that’s where the “more work” come from?
Is there any reason why this would imply the ratio of pricing of output:input tokens being commonly something like 3:1?
With KV caching, it costs almost exactly as many FLOPs to take 100 input tokens and generate 900 output tokens, as to take 900 input tokens and generate 100 output tokens. However, you need a lot more memory/memory bandwidth to process an output token than an input token, because to process an output token you also need to fit the KV cache in memory.
Got it, thanks!
But to process the 1001st input token, you also need to load all the 1000 tokens in memory, forming the cache (it does happen in one step though). And for each new output token, you surely don’t dump all the existing KV cache after each generation, only to load it again to append an extra KV vectors for the last generated token. So isn’t the extra work for output tokens just that the KV cache is accessed, generated, expanded, one token at a time, and that’s where the “more work” come from?
Is there any reason why this would imply the ratio of pricing of output:input tokens being commonly something like 3:1?
Heh, I actually think it’s answered here.