Edit 6/20/24: The authors updated the paper; see my comment.
To better understand the benefit of using features, for a few case studies of interest, we obtained linear probes using the same positive / negative examples that we used to identify the feature, by subtracting the residual stream activity in response to the negative example(s) from the activity in response to the positive example(s). We experimented with (1) visualizing the top-activating examples for probe directions, using the same pipeline we use for our features, and (2) using these probe directions for steering.
These vectors are not “linear probes” (which are generally optimized via SGD on a logistic regression task for a supervised dataset of yes/no examples), they are difference-in-means of activation vectors
So call them “steering vectors”!
As a side note, using actual linear probe directions tends to not steer models very well (see eg Inference Time Intervention table 3 on page 8)
In my experience, steering vectors generally require averaging over at least 32 contrast pairs. Anthropic only compares to 1-3 contrast pairs, which is inappropriate.
Since feature clamping needs fewer prompts for some tasks, that is a real benefit, but you have to amortize that benefit over the huge SAE effort needed to find those features.
Also note that you can generate synthetic data for the steering vectors using an LLM, it isn’t too hard.
For steering on a single task, then, steering vectors still win out in terms of amortized sample complexity (assuming the steering vectors are effective given ~32/128/256 contrast pairs, which I doubt will always be true)
In all cases, we were unable to interpret the probe directions from their activating examples. In most cases (with a few exceptions) we were unable to adjust the model’s behavior in the expected way by adding perturbations along the probe directions, even in cases where feature steering was successful (see this appendix for more details).
...
We note that these negative results do not imply that linear probes are not useful in general. Rather, they suggest that, in the “few-shot” prompting regime, they are less interpretable and effective for model steering than dictionary learning features.
I totally expect feature clamping to still win out in a bunch of comparisons, it’s really cool, but Anthropic’s actual comparisons don’t seem good and predictably underrate steering vectors.
The fact that the Anthropic paper gets the comparison (and especially terminology) meaningfully wrong makes me more wary of their results going forwards.
The authors updated the Scaling Monosemanticity paper. Relevant updates include:
1. In the intro, they added:
Features can be used to steer large models (see e.g. Influence on Behavior). This extends prior work on steering models using other methods (see Related Work).
2. The related work section now credits the rich history behind steering vectors / activation engineering, including not just my team’s work on activation additions, but also older literature in VAEs and GANs. (EDIT: Apparently this was always there? Maybe I misremembered the diff.)
3. The comparison results are now in an appendix and are much more hedged, noting they didn’t evaluate properly according to a steering vector baseline.
While it would have been better to have done this the first time, I really appreciate the team updating the paper to more clearly credit past work. :)
For steering on a single task, then, steering vectors still win out in terms of amortized sample complexity (assuming the steering vectors are effective given ~32/128/256 contrast pairs, which I doubt will always be true)
It would be hard for the steering vectors not to win given that the method as described involves spending a comparable amount of compute to training the model in the first place (from my understanding) and more if you want to get “all of the features”.
(Not trying to push back on your comment in general or disagreeing with this line, just noting how give the gap is such that the amount of steering vector pairs hardly matter if you just steer on a single task.)
‘The features we found represent a small subset of all the concepts learned by the model during training, and finding a full set of features using our current techniques would be cost-prohibitive (the computation required by our current approach would vastly exceed the compute used to train the model in the first place).’
’Our methodology is quite compute-intensive. The number of activations in the subject model (neurons, attention head dimensions, residual dimensions) scales roughly as 𝑂(𝑛^2/3), where 𝑛 is the number of subject model parameters. If we use a constant number of forward passes to interpret each activation, then in the case where the subject and explainer model are the same size, overall compute scales as 𝑂(𝑛^5/3).
On the other hand, this is perhaps favorable compared to pre-training itself. If pre-training scales data approximately linearly with parameters, then it uses compute 𝑂(𝑛^2).′
Yeah, if you use constant compute to explain each “feature” and features are proportional to model scale, this is only O(n^2) which is the same as training compute.
However, it seems plausible to me that you actually need to look at interactions between features and so you end up with O(log(n) n^2) or even O(n^3).
Also constant factors can easily destroy you here.
These vectors are not “linear probes” (which are generally optimized via SGD on a logistic regression task for a supervised dataset of yes/no examples), they are difference-in-means of activation vectors
I think DIM and LR aren’t spiritually different (e.g. LR with infinite L2 regularization gives you the same direction as DIM), even though in practice DIM is better for steering (and ablations). But I agree with you that “steering vectors” is the good expression to talk about directions used for steering (while I would use linear probes to talk about directions used to extract information or trained to extract information and used for another purpose).
The Scaling Monosemanticity paper doesn’t do a good job comparing feature clamping to steering vectors.
Edit 6/20/24: The authors updated the paper; see my comment.
These vectors are not “linear probes” (which are generally optimized via SGD on a logistic regression task for a supervised dataset of yes/no examples), they are difference-in-means of activation vectors
So call them “steering vectors”!
As a side note, using actual linear probe directions tends to not steer models very well (see eg Inference Time Intervention table 3 on page 8)
In my experience, steering vectors generally require averaging over at least 32 contrast pairs. Anthropic only compares to 1-3 contrast pairs, which is inappropriate.
Since feature clamping needs fewer prompts for some tasks, that is a real benefit, but you have to amortize that benefit over the huge SAE effort needed to find those features.
Also note that you can generate synthetic data for the steering vectors using an LLM, it isn’t too hard.
For steering on a single task, then, steering vectors still win out in terms of amortized sample complexity (assuming the steering vectors are effective given ~32/128/256 contrast pairs, which I doubt will always be true)
I totally expect feature clamping to still win out in a bunch of comparisons, it’s really cool, but Anthropic’s actual comparisons don’t seem good and predictably underrate steering vectors.
The fact that the Anthropic paper gets the comparison (and especially terminology) meaningfully wrong makes me more wary of their results going forwards.
The authors updated the Scaling Monosemanticity paper. Relevant updates include:
1. In the intro, they added:
2. The related work section now credits the rich history behind steering vectors / activation engineering, including not just my team’s work on activation additions, but also older literature in VAEs and GANs. (EDIT: Apparently this was always there? Maybe I misremembered the diff.)
3. The comparison results are now in an appendix and are much more hedged, noting they didn’t evaluate properly according to a steering vector baseline.
While it would have been better to have done this the first time, I really appreciate the team updating the paper to more clearly credit past work. :)
Oh, that’s great! Kudos to the authors for setting the record straight. I’m glad your work is now appropriately credited
[low importance]
It would be hard for the steering vectors not to win given that the method as described involves spending a comparable amount of compute to training the model in the first place (from my understanding) and more if you want to get “all of the features”.
(Not trying to push back on your comment in general or disagreeing with this line, just noting how give the gap is such that the amount of steering vector pairs hardly matter if you just steer on a single task.)
I think there’s something more general to the argument (related to SAEs seeming somewhat overkill in many ways, for strictly safety purposes).
For SAEs, the computational complexity would likely be on the same order as full pretraining; e.g. from Mapping the Mind of a Large Language Model:
‘The features we found represent a small subset of all the concepts learned by the model during training, and finding a full set of features using our current techniques would be cost-prohibitive (the computation required by our current approach would vastly exceed the compute used to train the model in the first place).’
While for activation steering approaches, the computational complexity should probably be similar to this ‘Computational requirements’ section from Language models can explain neurons in language models:
’Our methodology is quite compute-intensive. The number of activations in the subject model (neurons, attention head dimensions, residual dimensions) scales roughly as 𝑂(𝑛^2/3), where 𝑛 is the number of subject model parameters. If we use a constant number of forward passes to interpret each activation, then in the case where the subject and explainer model are the same size, overall compute scales as 𝑂(𝑛^5/3).
On the other hand, this is perhaps favorable compared to pre-training itself. If pre-training scales data approximately linearly with parameters, then it uses compute 𝑂(𝑛^2).′
Yeah, if you use constant compute to explain each “feature” and features are proportional to model scale, this is only O(n^2) which is the same as training compute.
However, it seems plausible to me that you actually need to look at interactions between features and so you end up with O(log(n) n^2) or even O(n^3).
Also constant factors can easily destroy you here.
I think DIM and LR aren’t spiritually different (e.g. LR with infinite L2 regularization gives you the same direction as DIM), even though in practice DIM is better for steering (and ablations). But I agree with you that “steering vectors” is the good expression to talk about directions used for steering (while I would use linear probes to talk about directions used to extract information or trained to extract information and used for another purpose).