Daniel Tan comments on Daniel Tan’s Shortform

Daniel Tan 17 Jul 2024 8:59 UTC
23 points
1
[Proposal] Can we develop a general steering technique for nonlinear representations? A case study on modular addition
Steering vectors are a recent and increasingly popular alignment technique. They are based on the observation that many features are encoded as linear directions in activation space; hence, intervening within this 1-dimensional subspace is an effective method for controlling that feature.
Can we extend this to nonlinear features? A simple example of a nonlinear feature is circular representations in modular arithmetic. Here, it’s clear that a simple “steering vector” will not work. Nonetheless, as the authors show, it’s possible to construct a nonlinear steering intervention that demonstrably influences the model to predict a different result.
Problem: The construction of a steering intervention in the modular addition paper relies heavily on the a-priori knowledge that the underlying feature geometry is a circle. Ideally, we wouldn’t need to fully elucidate this geometry in order for steering to be effective.
Therefore, we want a procedure which learns a nonlinear steering intervention given only the model’s activations and labels (e.g. the correct next-token).
Such a procedure might look something like this:
- Assume we have paired data $(x, y)$ for a given concept. $x$ is the model’s activations and $y$ is the label, e.g. the day of the week.
- Define a function $x’ = f_\theta(x, y, y’)$ that predicts the $x’$ for steering the model towards $y’$.
- Optimize $f_\theta(x, y, y’)$ using a dataset of steering examples.
- Evaluate the model under this steering intervention, and check if we’ve actually steered the model towards $y’$. Compare this to the ground-truth steering intervention.
If this works, it might be applicable to other examples of nonlinear feature geometries as well.
Thanks to David Chanin for useful discussions.
What links here?
- A Sober Look at Steering Vectors for LLMs by Joschka Braun (23 Nov 2024 17:30 UTC; 32 points)
- Bogdan Ionut Cirstea 17 Jul 2024 10:59 UTC
  3 points
  1
  Parent
  You might be interested in works like Kernelized Concept Erasure, Representation Surgery: Theory and Practice of Affine Steering, Identifying Linear Relational Concepts in Large Language Models.
  - Daniel Tan 22 Jul 2024 8:15 UTC
    2 points
    0
    Parent
    This is really interesting, thanks! As I understand, “affine steering” applies an affine map to the activations, and this is expressive enough to perform a “rotation” on the circle. David Chanin has told me before that LRC doesn’t really work for steering vectors. Didn’t grok kernelized concept erasure yet but will have another read.
    Generally, I am quite excited to implement existing work on more general steering interventions and then check whether they can automatically learn to steer modular addition

Daniel Tan comments on Daniel Tan’s Shortform

[Proposal] Can we develop a general steering technique for nonlinear representations? A case study on modular addition