Output tokens certainly do not scale linearly, even with a KV cache. The KV cache means you don’t need to recompute the k/q/v vectors for each of the previous tokens, but you still need to compute n kq dot products for the (n+1)’st token.
Output tokens certainly do not scale linearly, even with a KV cache. The KV cache means you don’t need to recompute the k/q/v vectors for each of the previous tokens, but you still need to compute n kq dot products for the (n+1)’st token.