Interesting ideas, and nicely explained! Some questions:
1) First notation: request patching means replacing the vector at activation A for R2 on C2 with vector at same activation A for R1 on C1. Then the question: Did you do any analysis on the set of vectors A as you vary R and C? Based on your results, I expect that the vector at A is similar if you keep R the same and vary C.
2) I found the success on the toy prompt injection surprising! My intuition up to that point was that R and C are independently represented to a large extent, and you could go from computing R2(C2) to R1(C2) by patching R1 from computation of R1(C1). But the success on preventing prompt injection means that corrupting C is somehow corrupting R too, meaning that C and R are actually coupled. What is your intuition here?
3) How robust do you think the results are if you make C and R more complex? E.g. C contains multiple characters who come from various countries but live in same city and R is ‘Where does character Alice come from’?
Thanks for your comment, these are great questions!
I did not conduct analyses of the vectors themselves. A concrete (and easy) experiment could be to create UMAP plot for the set of residual stream activations at the last position for different layers. I guess that i) you start with one big cluster. ii) multiple clusters determined by the value of R iii) multiple clusters determined by the value of R(C). I did not do such analysis because I decided to focus on causal intervention: it’s hard to know from the vectors alone what are the differences that matter for the model’s computation. Such analyses are useful as side sanity checks though (e.g. Figure 5 of https://arxiv.org/pdf/2310.15916.pdf ).
The particular kind of corruption of C—adding a distractor—is designed not to change the content of C. The distractor is crafted to be seen as a request for the model, i.e. to trigger the induction mechanism to repeat the token that comes next instead of answering the question.
Take the input X with C = “Alice, London”, R = “What is the city? The next story is in”, and distractor D = “The next story is in Paris.”*10. The distractor successfully makes the model output “Paris” instead of “London”.
My guess on what’s going on is that the request that gets compiled internally is “Find the token that comes after ‘The next story is in’ ”, instead of “Find a city in the context” or “Find the city in the previous paragraph” without the distractor.
When you patch the activation from a clean run, it restores the clean request representation and overwrites the induction request.
Given the generality of the phenomenon, my guess is that results would generalize to more complex cases. It is even possible that you can decompose in more steps how the request gets computed, e.g. i) represent the entity (“Alice”) you’re asking for (possibly using binding IDs) ii) represent the attribute you’re looking for (“origin country”) iii) retrieve the token.
Interesting ideas, and nicely explained! Some questions:
1) First notation: request patching means replacing the vector at activation A for R2 on C2 with vector at same activation A for R1 on C1. Then the question: Did you do any analysis on the set of vectors A as you vary R and C? Based on your results, I expect that the vector at A is similar if you keep R the same and vary C.
2) I found the success on the toy prompt injection surprising! My intuition up to that point was that R and C are independently represented to a large extent, and you could go from computing R2(C2) to R1(C2) by patching R1 from computation of R1(C1). But the success on preventing prompt injection means that corrupting C is somehow corrupting R too, meaning that C and R are actually coupled. What is your intuition here?
3) How robust do you think the results are if you make C and R more complex? E.g. C contains multiple characters who come from various countries but live in same city and R is ‘Where does character Alice come from’?
Thanks for your comment, these are great questions!
I did not conduct analyses of the vectors themselves. A concrete (and easy) experiment could be to create UMAP plot for the set of residual stream activations at the last position for different layers. I guess that i) you start with one big cluster. ii) multiple clusters determined by the value of R iii) multiple clusters determined by the value of R(C). I did not do such analysis because I decided to focus on causal intervention: it’s hard to know from the vectors alone what are the differences that matter for the model’s computation. Such analyses are useful as side sanity checks though (e.g. Figure 5 of https://arxiv.org/pdf/2310.15916.pdf ).
The particular kind of corruption of C—adding a distractor—is designed not to change the content of C. The distractor is crafted to be seen as a request for the model, i.e. to trigger the induction mechanism to repeat the token that comes next instead of answering the question.
Take the input X with C = “Alice, London”, R = “What is the city? The next story is in”, and distractor D = “The next story is in Paris.”*10. The distractor successfully makes the model output “Paris” instead of “London”.
My guess on what’s going on is that the request that gets compiled internally is “Find the token that comes after ‘The next story is in’ ”, instead of “Find a city in the context” or “Find the city in the previous paragraph” without the distractor.
When you patch the activation from a clean run, it restores the clean request representation and overwrites the induction request.
Given the generality of the phenomenon, my guess is that results would generalize to more complex cases. It is even possible that you can decompose in more steps how the request gets computed, e.g. i) represent the entity (“Alice”) you’re asking for (possibly using binding IDs) ii) represent the attribute you’re looking for (“origin country”) iii) retrieve the token.