But to process the 1001st input token, you also need to load all the 1000 tokens in memory, forming the cache (it does happen in one step though). And for each new output token, you surely don’t dump all the existing KV cache after each generation, only to load it again to append an extra KV vectors for the last generated token. So isn’t the extra work for output tokens just that the KV cache is accessed, generated, expanded, one token at a time, and that’s where the “more work” come from?
Is there any reason why this would imply the ratio of pricing of output:input tokens being commonly something like 3:1?
Got it, thanks!
But to process the 1001st input token, you also need to load all the 1000 tokens in memory, forming the cache (it does happen in one step though). And for each new output token, you surely don’t dump all the existing KV cache after each generation, only to load it again to append an extra KV vectors for the last generated token. So isn’t the extra work for output tokens just that the KV cache is accessed, generated, expanded, one token at a time, and that’s where the “more work” come from?
Is there any reason why this would imply the ratio of pricing of output:input tokens being commonly something like 3:1?
Heh, I actually think it’s answered here.