Thanks for your comment, these are great questions!
I did not conduct analyses of the vectors themselves. A concrete (and easy) experiment could be to create UMAP plot for the set of residual stream activations at the last position for different layers. I guess that i) you start with one big cluster. ii) multiple clusters determined by the value of R iii) multiple clusters determined by the value of R(C). I did not do such analysis because I decided to focus on causal intervention: it’s hard to know from the vectors alone what are the differences that matter for the model’s computation. Such analyses are useful as side sanity checks though (e.g. Figure 5 of https://arxiv.org/pdf/2310.15916.pdf ).
The particular kind of corruption of C—adding a distractor—is designed not to change the content of C. The distractor is crafted to be seen as a request for the model, i.e. to trigger the induction mechanism to repeat the token that comes next instead of answering the question.
Take the input X with C = “Alice, London”, R = “What is the city? The next story is in”, and distractor D = “The next story is in Paris.”*10. The distractor successfully makes the model output “Paris” instead of “London”.
My guess on what’s going on is that the request that gets compiled internally is “Find the token that comes after ‘The next story is in’ ”, instead of “Find a city in the context” or “Find the city in the previous paragraph” without the distractor.
When you patch the activation from a clean run, it restores the clean request representation and overwrites the induction request.
Given the generality of the phenomenon, my guess is that results would generalize to more complex cases. It is even possible that you can decompose in more steps how the request gets computed, e.g. i) represent the entity (“Alice”) you’re asking for (possibly using binding IDs) ii) represent the attribute you’re looking for (“origin country”) iii) retrieve the token.
Thanks for your comment, these are great questions!
I did not conduct analyses of the vectors themselves. A concrete (and easy) experiment could be to create UMAP plot for the set of residual stream activations at the last position for different layers. I guess that i) you start with one big cluster. ii) multiple clusters determined by the value of R iii) multiple clusters determined by the value of R(C). I did not do such analysis because I decided to focus on causal intervention: it’s hard to know from the vectors alone what are the differences that matter for the model’s computation. Such analyses are useful as side sanity checks though (e.g. Figure 5 of https://arxiv.org/pdf/2310.15916.pdf ).
The particular kind of corruption of C—adding a distractor—is designed not to change the content of C. The distractor is crafted to be seen as a request for the model, i.e. to trigger the induction mechanism to repeat the token that comes next instead of answering the question.
Take the input X with C = “Alice, London”, R = “What is the city? The next story is in”, and distractor D = “The next story is in Paris.”*10. The distractor successfully makes the model output “Paris” instead of “London”.
My guess on what’s going on is that the request that gets compiled internally is “Find the token that comes after ‘The next story is in’ ”, instead of “Find a city in the context” or “Find the city in the previous paragraph” without the distractor.
When you patch the activation from a clean run, it restores the clean request representation and overwrites the induction request.
Given the generality of the phenomenon, my guess is that results would generalize to more complex cases. It is even possible that you can decompose in more steps how the request gets computed, e.g. i) represent the entity (“Alice”) you’re asking for (possibly using binding IDs) ii) represent the attribute you’re looking for (“origin country”) iii) retrieve the token.