This seems like excellent work. I’m excited to see these results, they seem to be strong evidence that “just add the ‘truth steering vector’” works, albeit with finer-grained intervention on a subset of attention heads. That’s great news.
Given my understanding of your results, I am now more optimistic about:
Flexibly retargeting LLM behavior via activation engineering/ITI without damaging capabilities
Conceptual and more specific steering vectors in LLMs (like secret-spilling vectors which get LLMs to divulge any secrets they’ve been given)
Alignment overall.
We propose Inference-Time Intervention (ITI): shifting the activations along the difference of the two distribution means during inference time; model weights are kept intact.
In the language of Steering GPT-2-XL by adding an activation vector, this is an activation addition using the steering vector from averaging all of the activation vectors. The vector is applied to the K most truth-relevant heads at a given layer, as judged by linear probe validation accuracy.
It’s quite surprising that the “mass mean shift”[1] outperforms the probe direction so strongly! This shows that the directions which the model uses to generate true or false statements are very different from the directions which get found by probing. Further evidence that probe directions are often not very causally relevant for the LLM’s outputs.
The transfer shown in table 4 seems decent and I’m glad it’s there, but it would have been nice if the mean mass shift vector had transferred even more strongly. Seems like one can get a substantial general boost in truthfulness with basic activation engineering, but not all the way without additional insights.
We propose Inference-Time Intervention (ITI)
The “mass mean shift” technique seems like independent development of the “activation addition” technique from Understanding and controlling a maze-solving policy network and Steering GPT-2-XL by adding an activation vector (albeit with some differences, like restricting modification to top K heads). There’s a question of “what should we call the technique?”. Are “activation engineering” and “ITI” referring to the same set of interventions?
It seems like the answer is “no”, since you use “ITI” to refer to “adding in an activation vector”, which seems better described as “activation addition.” A few considerations:
“ITI” has to be unpacked since it’s an acronym
“Inference-time intervention” is generic and could also describe zero-ablating the outputs of given heads, and so it seems strange to potentially use “ITI” to refer only to “adding in an activation vector”
“Activation addition” is more specific.
“Inference-time intervention” is more descriptive of how the technique works.
Open to your thoughts here.
However, what is completely missing from LLMs is a good target other than minimizing pretraining loss. How to endow an aligned target is an open problem and ITI serves as my initial exploration towards this end.
Personally, I think that ITI is actually far more promising than the “how to endow an aligned target” question.
In figure 5 in the paper, “Indexical error: Time” appears twice as an x-axis tick label?
Previous work has shown that ‘steering’ vectors—both trained and hand-selected—can be used for style transfer in language models (Subramani et al., 2022; Turner et al., 2023).
I think it’d be more accurate to describe “steering vectors” as “can be used to control the style and content of language model generations”?
This seems like excellent work. I’m excited to see these results, they seem to be strong evidence that “just add the ‘truth steering vector’” works, albeit with finer-grained intervention on a subset of attention heads. That’s great news.
Given my understanding of your results, I am now more optimistic about:
Flexibly retargeting LLM behavior via activation engineering/ITI without damaging capabilities
Conceptual and more specific steering vectors in LLMs (like secret-spilling vectors which get LLMs to divulge any secrets they’ve been given)
Alignment overall.
In the language of Steering GPT-2-XL by adding an activation vector, this is an activation addition using the steering vector from averaging all of the activation vectors. The vector is applied to the K most truth-relevant heads at a given layer, as judged by linear probe validation accuracy.
It’s quite surprising that the “mass mean shift”[1] outperforms the probe direction so strongly! This shows that the directions which the model uses to generate true or false statements are very different from the directions which get found by probing. Further evidence that probe directions are often not very causally relevant for the LLM’s outputs.
The transfer shown in table 4 seems decent and I’m glad it’s there, but it would have been nice if the mean mass shift vector had transferred even more strongly. Seems like one can get a substantial general boost in truthfulness with basic activation engineering, but not all the way without additional insights.
The “mass mean shift” technique seems like independent development of the “activation addition” technique from Understanding and controlling a maze-solving policy network and Steering GPT-2-XL by adding an activation vector (albeit with some differences, like restricting modification to top K heads). There’s a question of “what should we call the technique?”. Are “activation engineering” and “ITI” referring to the same set of interventions?
It seems like the answer is “no”, since you use “ITI” to refer to “adding in an activation vector”, which seems better described as “activation addition.” A few considerations:
“ITI” has to be unpacked since it’s an acronym
“Inference-time intervention” is generic and could also describe zero-ablating the outputs of given heads, and so it seems strange to potentially use “ITI” to refer only to “adding in an activation vector”
“Activation addition” is more specific.
“Inference-time intervention” is more descriptive of how the technique works.
Open to your thoughts here.
Personally, I think that ITI is actually far more promising than the “how to endow an aligned target” question.
In figure 5 in the paper, “Indexical error: Time” appears twice as an x-axis tick label?
I think it’d be more accurate to describe “steering vectors” as “can be used to control the style and content of language model generations”?
steering vector=”avg difference between Truthful and Untruthful activations on the top K=48 heads”