[Proposal] Can we develop a general steering technique for nonlinear representations? A case study on modular addition
Steering vectors are a recent and increasingly popular alignment technique. They are based on the observation that many features are encoded as linear directions in activation space; hence, intervening within this 1-dimensional subspace is an effective method for controlling that feature.
Can we extend this to nonlinear features? A simple example of a nonlinear feature is circular representations in modular arithmetic. Here, it’s clear that a simple “steering vector” will not work. Nonetheless, as the authors show, it’s possible to construct a nonlinear steering intervention that demonstrably influences the model to predict a different result.
Problem: The construction of a steering intervention in the modular addition paper relies heavily on the a-priori knowledge that the underlying feature geometry is a circle. Ideally, we wouldn’t need to fully elucidate this geometry in order for steering to be effective.
Therefore, we want a procedure which learns a nonlinear steering intervention given only the model’s activations and labels (e.g. the correct next-token).
Such a procedure might look something like this:
Assume we have paired data $(x, y)$ for a given concept. $x$ is the model’s activations and $y$ is the label, e.g. the day of the week.
Define a function $x’ = f_\theta(x, y, y’)$ that predicts the $x’$ for steering the model towards $y’$.
Optimize $f_\theta(x, y, y’)$ using a dataset of steering examples.
Evaluate the model under this steering intervention, and check if we’ve actually steered the model towards $y’$. Compare this to the ground-truth steering intervention.
If this works, it might be applicable to other examples of nonlinear feature geometries as well.
This is really interesting, thanks! As I understand, “affine steering” applies an affine map to the activations, and this is expressive enough to perform a “rotation” on the circle. David Chanin has told me before that LRC doesn’t really work for steering vectors. Didn’t grok kernelized concept erasure yet but will have another read.
Generally, I am quite excited to implement existing work on more general steering interventions and then check whether they can automatically learn to steer modular addition
[Proposal] Can we develop a general steering technique for nonlinear representations? A case study on modular addition
Steering vectors are a recent and increasingly popular alignment technique. They are based on the observation that many features are encoded as linear directions in activation space; hence, intervening within this 1-dimensional subspace is an effective method for controlling that feature.
Can we extend this to nonlinear features? A simple example of a nonlinear feature is circular representations in modular arithmetic. Here, it’s clear that a simple “steering vector” will not work. Nonetheless, as the authors show, it’s possible to construct a nonlinear steering intervention that demonstrably influences the model to predict a different result.
Problem: The construction of a steering intervention in the modular addition paper relies heavily on the a-priori knowledge that the underlying feature geometry is a circle. Ideally, we wouldn’t need to fully elucidate this geometry in order for steering to be effective.
Therefore, we want a procedure which learns a nonlinear steering intervention given only the model’s activations and labels (e.g. the correct next-token).
Such a procedure might look something like this:
Assume we have paired data $(x, y)$ for a given concept. $x$ is the model’s activations and $y$ is the label, e.g. the day of the week.
Define a function $x’ = f_\theta(x, y, y’)$ that predicts the $x’$ for steering the model towards $y’$.
Optimize $f_\theta(x, y, y’)$ using a dataset of steering examples.
Evaluate the model under this steering intervention, and check if we’ve actually steered the model towards $y’$. Compare this to the ground-truth steering intervention.
If this works, it might be applicable to other examples of nonlinear feature geometries as well.
Thanks to David Chanin for useful discussions.
You might be interested in works like Kernelized Concept Erasure, Representation Surgery: Theory and Practice of Affine Steering, Identifying Linear Relational Concepts in Large Language Models.
This is really interesting, thanks! As I understand, “affine steering” applies an affine map to the activations, and this is expressive enough to perform a “rotation” on the circle. David Chanin has told me before that LRC doesn’t really work for steering vectors. Didn’t grok kernelized concept erasure yet but will have another read.
Generally, I am quite excited to implement existing work on more general steering interventions and then check whether they can automatically learn to steer modular addition