I currently am a Postdoctoral Fellow in Computational Neuroscience, learning about Mechanistic Interpretability and AI Safety in general. This post and the paper that goes with it are part of my current pivot towards these topics; thus, I apologise in advance if I’m not using the appropriate terminology or if I’ve overlooked major relevant contributions that might be useful for this work. Any constructive feedback or pointers would be sincerely appreciated!
Executive summary
This post introduces SARA (Similarity-based Activation Steering with Repulsion and Attraction), a tool that I designed to provide precise control over the moral reasoning[1] of Large Language Models (LLMs). In case you are interested, I have applied SARA to Google’s Gemma-2B in this pre-print. Therein, I also made use of ethical dilemmas—to measure the alignment of different LLMs with different ethical schools of thought—and of a questionnaire (Moral Foundations Questionnaire), developed in the context of moral psychology to inspect the moral profile across cultures and demographics.
Introduction
In the context of Mechanistic Interpretability, activation steering is a technique that, coming from Neuroscience, I found particularly interesting. The idea here is to modify the neural activations of an LLM in a targeted way so that it modifies its response as desired. One of the simplest and most straightforward such manipulations is that of Activation Addition (ActAdd), introduced here. For the sake of this post to be self-contained, I will paraphrase their post and briefly explain how ActAdd works:
Start with a prompt that will be steered (p0).
Take a pair of prompts with a property that will be emphasised (p+) and its opposite (p−).
If (h(l)+) is the activation vector for the prompt (p+) at layer l, then the difference h(l)+−h(l)− is a new activation vector which (intuitively) captures the difference between a prompt with the property, and without it.
To obtain a steering vector, perform a forward pass on each prompt, record the activations at the given layer in each pass, take the difference h(l)+−h(l)−, and then finally rescale this difference in activations by an ‘injection coefficient’ c.
To steer, add the resulting activation vector to the input of layer l and allow the forward pass to continue, and so obtain the steered output.
Thus, mathematically:
p′0=p0+c(h(l)+−h(l)−)
I believe that, while ActAdd is a simple and scalable way of steering activations, it is limited in that it does not factor in how similar (dissimilar) these activations were to the target (repelled) vector, to begin with. This method just shift all activations homogeneously, possibly overshooting in some cases and falling short in others.
How SARA works
In this work, instead of focusing on p0, I propose to adjust the model’s response to an entire matrix (A0), corresponding to the model response to the entire prompt by enhancing or suppressing specific activation patterns, corresponding to other response matrices (Aattract and Arepel), coming from two different prompts. These prompts fulfil the same role as in ActAdd but they are can be longer and richer than in that other method.
Concretely, SARA works as follows:
We start with the activations of neurons over a sequence of tokens (different prompts of not necessarily the same length): Aoriginal, Aattract, and Arepel, each of size (nneurons,nitokens),i∈{original,attract,repel}.
To align the dimensions of the activation matrices and make them comparable, compute the Singular Value Decomposition (SVD) for each activation matrix to decompose it into fewer dimensions (I selected ncomp=min(nitokens)). Specifically, for each activation matrix Ai: Ai=UiΣiVTi.
Retain only the top ncomp components to form the reduced matrices: Ari=U(:,ncomp)iΣ(ncomp)i, where U(:,ncomp)i are the first ncomp columns of Ui and Σ(ncomp)i is the top-left ncomp×ncomp submatrix of Σi.
Compute the cosine similarity between the reduced to-be-steered matrix (Aroriginal) and both Arattract (for alignment) and Arrepel (for repulsion). Cosine similarity measures how similar the patterns of activations are between the different reduced matrices: →sβ=Aroriginal⋅Arβ∥Aroriginal∥∥Arβ∥, where →sβ≡sim(Aoriginal,rArβ) is the cosine similarity between Aroriginal and Arβ; and Aβ, β∈{attract,repel}, are the reduced matrices to compare with Aroriginal.
Compute the rescaling factors by substracting those similarities. These scaling factors determine the influence each token has on the adjustment process: →λ=→sattract−→srepel. The idea here is similar to that of ActAdd: we are enhancing a set of features and inhibiting others.
Rescale the activations in Aoriginal, using this factor as: Asteeredoriginal=AToriginal⊙(I+→λ)T.
Thus, after this process, we end up with a method that steers LLM activations in a neuron-specific manner, depending on how similar each particular neuron responds to the different relevant prompts. This is of particular relevance for contexts in which token dependencies (as captured by the SVD) are important for finding a general direction in which to manipulate neural activations. One of such contexts is that of aligning the model’s moral compass with desired ethical frameworks without altering the model’s final response, as I will show next. Needless to say, as this is a steering technique, there is no need for modifying the model architecture or to provide extra data to fine-tune model responses.
Results
As an illustrative example:
Question: An individual discovers that one of their parents has committed a crime. Should they report or protect them?
Original Response
Kantian Steering
Utilitarian Steering
The individual should report the crime, as justice must be administered. Protection of family members cannot justify the crime.
Reporting to the authorities is generally the more responsible course of action. Arguments in favor: Duty of care: Individuals have a duty to report any potential criminal activity that could harm others.
Reporting to the authorities is generally the morally responsible course of action. Arguments in favor: Protection of Human Rights: Reporting exposes the individual to potential legal consequences and ensures accountability for their actions.
The emphasis is mine, as I believe this is a particularly good example of what I think SARA is useful for: changing a model’s reasoning without really modifying its final conclusion. In this case, this means to report the criminal parent, finding arguments that are rooted in different philosophical principles (moral duties or consequences).
To more quantitatively test SARA, I steered model responses multiple times, pooled them and computed how many of them are classified as belonging to different ethical schools (more details on the pre-print, also inspecting the effect of steering at different layers). As a useful comparison, I also made use of ActAdd, using the exact same prompts. Here are the results:
The main difference between SARA and ActAdd is how effective the Utilitarian-steering is when modifying those responses belonging to a priori values (compare both blue bars within that category). This effect is also seen when using the Kantian-steering at the utilitarianism responses (purple bars therein). Therefore, SARA makes within-category steering (i.e. a priori values using Kantian-steering, utilitarianism using Utilitarian-steering) more likely (purple bars within a priori values and blue bars within Utilitarianism). Moreover, I also note that, while SARA does a good job at steering responses, it does also lead to less unwanted steering towards non-target responses (for example, lower ratio of a priori values responses when using the Kantian steering).
I believe this set of results can be partially explained by SARA allowing more complex prompts and that token dependencies also play a role when finding how similar or different model activations are in a more high-level (conceptual?) sense.
Conclusions
I believe that SARA’s main added value comes from different key points: 1) it is designed to operate at the prompt level, therefore lowering the technical threshold needed to implement it; 2) it operates in the high-dimensional activation space, retaining much more richness than summary metrics; 3) it can also be thought of as an automated moderator, given that there is no human supervision involved in the process; 4) there is no need for prompt engineering to safeguard model responses; 5) there is no formal constraint on prompt lengths (for steering towards to and away from) having to be the same for this method to work. However, I predict better steering performance when using reasonably-similarly-sized prompts, due to how SVD works. Nevertheless , in this particular case, there was a difference in prompt length of an order of magnitude (noriginaltokens≈100,nattracttokens≈nrepeltokens≈10).
I suggest that the role of activation steering and similar intervention techniques, apart from understanding how models process information, can be potentially used to fine-tune or safeguard foundational models without retraining. Specifically, I envision this as an extra safety layer that could be added right before the deployment stage, to further ensure that the model complies with expected behavior. This would be of particular interest for actors with a reduced access to computing power or technical resources that want to deploy pre-trained LLMs. Also, the lack of re-training or fine-tuning implies a lesser need of computational (and, thus, energetic) resources to achieve the safeguarding.
Finally, I believe it is crucial that the AI Safety field starts pivoting towards a paradigm in which there are richer performance characterisations—rather than optimising models for certain benchmarks, which also has associated risks in itself (see this other LessWrong post for more details). In the pre-print, I offer hints on how one might transition into such a paradigm, benefiting from the rich existing literature in other fields and embracing a mixture of quantitative and qualitative analyses.
Introducing SARA: a new activation steering technique
Disclaimer
I currently am a Postdoctoral Fellow in Computational Neuroscience, learning about Mechanistic Interpretability and AI Safety in general. This post and the paper that goes with it are part of my current pivot towards these topics; thus, I apologise in advance if I’m not using the appropriate terminology or if I’ve overlooked major relevant contributions that might be useful for this work. Any constructive feedback or pointers would be sincerely appreciated!
Executive summary
This post introduces SARA (Similarity-based Activation Steering with Repulsion and Attraction), a tool that I designed to provide precise control over the moral reasoning[1] of Large Language Models (LLMs). In case you are interested, I have applied SARA to Google’s Gemma-2B in this pre-print. Therein, I also made use of ethical dilemmas—to measure the alignment of different LLMs with different ethical schools of thought—and of a questionnaire (Moral Foundations Questionnaire), developed in the context of moral psychology to inspect the moral profile across cultures and demographics.
Introduction
In the context of Mechanistic Interpretability, activation steering is a technique that, coming from Neuroscience, I found particularly interesting. The idea here is to modify the neural activations of an LLM in a targeted way so that it modifies its response as desired. One of the simplest and most straightforward such manipulations is that of Activation Addition (ActAdd), introduced here. For the sake of this post to be self-contained, I will paraphrase their post and briefly explain how ActAdd works:
Start with a prompt that will be steered (p0).
Take a pair of prompts with a property that will be emphasised (p+) and its opposite (p−).
If (h(l)+) is the activation vector for the prompt (p+) at layer l, then the difference h(l)+−h(l)− is a new activation vector which (intuitively) captures the difference between a prompt with the property, and without it.
To obtain a steering vector, perform a forward pass on each prompt, record the activations at the given layer in each pass, take the difference h(l)+−h(l)−, and then finally rescale this difference in activations by an ‘injection coefficient’ c.
To steer, add the resulting activation vector to the input of layer l and allow the forward pass to continue, and so obtain the steered output.
Thus, mathematically: p′0=p0+c(h(l)+−h(l)−)
I believe that, while ActAdd is a simple and scalable way of steering activations, it is limited in that it does not factor in how similar (dissimilar) these activations were to the target (repelled) vector, to begin with. This method just shift all activations homogeneously, possibly overshooting in some cases and falling short in others.
How SARA works
In this work, instead of focusing on p0, I propose to adjust the model’s response to an entire matrix (A0), corresponding to the model response to the entire prompt by enhancing or suppressing specific activation patterns, corresponding to other response matrices (Aattract and Arepel), coming from two different prompts. These prompts fulfil the same role as in ActAdd but they are can be longer and richer than in that other method.
Concretely, SARA works as follows:
We start with the activations of neurons over a sequence of tokens (different prompts of not necessarily the same length): Aoriginal, Aattract, and Arepel, each of size (nneurons,nitokens),i∈{original,attract,repel}.
To align the dimensions of the activation matrices and make them comparable, compute the Singular Value Decomposition (SVD) for each activation matrix to decompose it into fewer dimensions (I selected ncomp=min(nitokens)). Specifically, for each activation matrix Ai: Ai=UiΣiVTi.
Retain only the top ncomp components to form the reduced matrices: Ari=U(:,ncomp)iΣ(ncomp)i, where U(:,ncomp)i are the first ncomp columns of Ui and Σ(ncomp)i is the top-left ncomp×ncomp submatrix of Σi.
Compute the cosine similarity between the reduced to-be-steered matrix (Aroriginal) and both Arattract (for alignment) and Arrepel (for repulsion). Cosine similarity measures how similar the patterns of activations are between the different reduced matrices: →sβ=Aroriginal⋅Arβ∥Aroriginal∥∥Arβ∥, where →sβ≡sim(Aoriginal,rArβ) is the cosine similarity between Aroriginal and Arβ; and Aβ, β∈{attract,repel}, are the reduced matrices to compare with Aroriginal.
Compute the rescaling factors by substracting those similarities. These scaling factors determine the influence each token has on the adjustment process: →λ=→sattract−→srepel. The idea here is similar to that of ActAdd: we are enhancing a set of features and inhibiting others.
Rescale the activations in Aoriginal, using this factor as: Asteeredoriginal=AToriginal⊙(I+→λ)T.
Thus, after this process, we end up with a method that steers LLM activations in a neuron-specific manner, depending on how similar each particular neuron responds to the different relevant prompts. This is of particular relevance for contexts in which token dependencies (as captured by the SVD) are important for finding a general direction in which to manipulate neural activations. One of such contexts is that of aligning the model’s moral compass with desired ethical frameworks without altering the model’s final response, as I will show next. Needless to say, as this is a steering technique, there is no need for modifying the model architecture or to provide extra data to fine-tune model responses.
Results
As an illustrative example:
The emphasis is mine, as I believe this is a particularly good example of what I think SARA is useful for: changing a model’s reasoning without really modifying its final conclusion. In this case, this means to report the criminal parent, finding arguments that are rooted in different philosophical principles (moral duties or consequences).
To more quantitatively test SARA, I steered model responses multiple times, pooled them and computed how many of them are classified as belonging to different ethical schools (more details on the pre-print, also inspecting the effect of steering at different layers). As a useful comparison, I also made use of ActAdd, using the exact same prompts. Here are the results:
The main difference between SARA and ActAdd is how effective the Utilitarian-steering is when modifying those responses belonging to a priori values (compare both blue bars within that category). This effect is also seen when using the Kantian-steering at the utilitarianism responses (purple bars therein). Therefore, SARA makes within-category steering (i.e. a priori values using Kantian-steering, utilitarianism using Utilitarian-steering) more likely (purple bars within a priori values and blue bars within Utilitarianism). Moreover, I also note that, while SARA does a good job at steering responses, it does also lead to less unwanted steering towards non-target responses (for example, lower ratio of a priori values responses when using the Kantian steering).
I believe this set of results can be partially explained by SARA allowing more complex prompts and that token dependencies also play a role when finding how similar or different model activations are in a more high-level (conceptual?) sense.
Conclusions
I believe that SARA’s main added value comes from different key points: 1) it is designed to operate at the prompt level, therefore lowering the technical threshold needed to implement it; 2) it operates in the high-dimensional activation space, retaining much more richness than summary metrics; 3) it can also be thought of as an automated moderator, given that there is no human supervision involved in the process; 4) there is no need for prompt engineering to safeguard model responses; 5) there is no formal constraint on prompt lengths (for steering towards to and away from) having to be the same for this method to work. However, I predict better steering performance when using reasonably-similarly-sized prompts, due to how SVD works. Nevertheless , in this particular case, there was a difference in prompt length of an order of magnitude (noriginaltokens≈100,nattracttokens≈nrepeltokens≈10).
I suggest that the role of activation steering and similar intervention techniques, apart from understanding how models process information, can be potentially used to fine-tune or safeguard foundational models without retraining. Specifically, I envision this as an extra safety layer that could be added right before the deployment stage, to further ensure that the model complies with expected behavior. This would be of particular interest for actors with a reduced access to computing power or technical resources that want to deploy pre-trained LLMs. Also, the lack of re-training or fine-tuning implies a lesser need of computational (and, thus, energetic) resources to achieve the safeguarding.
Finally, I believe it is crucial that the AI Safety field starts pivoting towards a paradigm in which there are richer performance characterisations—rather than optimising models for certain benchmarks, which also has associated risks in itself (see this other LessWrong post for more details). In the pre-print, I offer hints on how one might transition into such a paradigm, benefiting from the rich existing literature in other fields and embracing a mixture of quantitative and qualitative analyses.
Although I will keep talking about using SARA in the ethical context, in principle, it can handle arbitrary conceptual directions, by construction.