During inference, for each token and each layer over it, the attention block computes some vectors, the data called the KV cache. For the current token, the contribution of an attention block in some layer to the residual stream is computed by looking at the entries in the KV cache of the same layer across all the preceding tokens. This won’t contribute to the KV cache entry for the current token at the same layer, it only influences the entry at the next layer, which is how all of this can run in parallel when processing input tokens and in training. The dataflow is shallow but wide.
So I would guess it should be possible to post-train an LLM to give answers like ”................… Yes” instead of “Because 7! contains both 3 and 5 as factors, which multiply to 15 Yes”, and the LLM would still be able to take advantage of CoT (for more challenging questions), because it would be following a line of reasoning written down in the KV cache lines in each layer across the preceding tokens, even if in the first layer there is always the same uninformative dot token. The tokens of the question are still explicitly there and kick off the process by determining the KV cache entries over the first dot tokens of the anwer, which can then be taken into account when computing the KV cache entries over the following dot tokens (moving up a layer where the dependence on the KV cache data over the preceding dot tokens is needed), and so on.
So I would guess it should be possible to post-train an LLM to give answers like ”................… Yes” instead of “Because 7! contains both 3 and 5 as factors, which multiply to 15 Yes”, and the LLM would still be able to take advantage of CoT
This doesn’t necessarily follow—on a standard transformer architecture, this will give you more parallel computation but no more serial computation than you had before. The bit where the LLM does N layers’ worth of serial thinking to say “3” and then that “3″ token can be fed back into the start of N more layers’ worth of serial computation is not something that this strategy can replicate!
That’s relevant, but about what I expected and why I hedged with “it should be possible to post-train”, which that paper doesn’t explore. Residual stream on many tokens is working memory, N layers of “vertical” compute over one token only have one activation vector to work with, while with more filler tokens you have many activation vectors that can work on multiple things in parallel and then aggregate. If a weaker model doesn’t take advantage of this, or gets too hung up on concrete tokens to think about other things in the meantime, instead of being able to maintain multiple trains of thought simultaneously, a stronger model might[1].
Performance on large questions (such as reading comprehension) with immediate answer (no CoT) shows that N layers across many tokens and no opportunity to get deeper serial compute is sufficient for many purposes. But a question is only fully understood when it’s read completely, so some of the thinking about the answer can’t start before that. If there are no more tokens, this creates an artificial constraint on working memory for thinking about the answer, filler tokens should be useful for lifting it. Repeating the question seems to help for example (see Figure 3 and Table 5).
The paper is from Jul 2023 and not from OpenAI, so it didn’t get to play with 2e25+ FLOPs models, and a new wave of 2e26+ FLOPs models is currently imminent.
During inference, for each token and each layer over it, the attention block computes some vectors, the data called the KV cache. For the current token, the contribution of an attention block in some layer to the residual stream is computed by looking at the entries in the KV cache of the same layer across all the preceding tokens. This won’t contribute to the KV cache entry for the current token at the same layer, it only influences the entry at the next layer, which is how all of this can run in parallel when processing input tokens and in training. The dataflow is shallow but wide.
So I would guess it should be possible to post-train an LLM to give answers like ”................… Yes” instead of “Because 7! contains both 3 and 5 as factors, which multiply to 15 Yes”, and the LLM would still be able to take advantage of CoT (for more challenging questions), because it would be following a line of reasoning written down in the KV cache lines in each layer across the preceding tokens, even if in the first layer there is always the same uninformative dot token. The tokens of the question are still explicitly there and kick off the process by determining the KV cache entries over the first dot tokens of the anwer, which can then be taken into account when computing the KV cache entries over the following dot tokens (moving up a layer where the dependence on the KV cache data over the preceding dot tokens is needed), and so on.
This doesn’t necessarily follow—on a standard transformer architecture, this will give you more parallel computation but no more serial computation than you had before. The bit where the LLM does N layers’ worth of serial thinking to say “3” and then that “3″ token can be fed back into the start of N more layers’ worth of serial computation is not something that this strategy can replicate!
Empirically, if you look at figure 5 in Measuring Faithfulness in Chain-of-Thought Reasoning, adding filler tokens doesn’t really seem to help models get these questions right:
That’s relevant, but about what I expected and why I hedged with “it should be possible to post-train”, which that paper doesn’t explore. Residual stream on many tokens is working memory, N layers of “vertical” compute over one token only have one activation vector to work with, while with more filler tokens you have many activation vectors that can work on multiple things in parallel and then aggregate. If a weaker model doesn’t take advantage of this, or gets too hung up on concrete tokens to think about other things in the meantime, instead of being able to maintain multiple trains of thought simultaneously, a stronger model might[1].
Performance on large questions (such as reading comprehension) with immediate answer (no CoT) shows that N layers across many tokens and no opportunity to get deeper serial compute is sufficient for many purposes. But a question is only fully understood when it’s read completely, so some of the thinking about the answer can’t start before that. If there are no more tokens, this creates an artificial constraint on working memory for thinking about the answer, filler tokens should be useful for lifting it. Repeating the question seems to help for example (see Figure 3 and Table 5).
The paper is from Jul 2023 and not from OpenAI, so it didn’t get to play with 2e25+ FLOPs models, and a new wave of 2e26+ FLOPs models is currently imminent.