For steering on a single task, then, steering vectors still win out in terms of amortized sample complexity (assuming the steering vectors are effective given ~32/128/256 contrast pairs, which I doubt will always be true)
It would be hard for the steering vectors not to win given that the method as described involves spending a comparable amount of compute to training the model in the first place (from my understanding) and more if you want to get “all of the features”.
(Not trying to push back on your comment in general or disagreeing with this line, just noting how give the gap is such that the amount of steering vector pairs hardly matter if you just steer on a single task.)
‘The features we found represent a small subset of all the concepts learned by the model during training, and finding a full set of features using our current techniques would be cost-prohibitive (the computation required by our current approach would vastly exceed the compute used to train the model in the first place).’
’Our methodology is quite compute-intensive. The number of activations in the subject model (neurons, attention head dimensions, residual dimensions) scales roughly as 𝑂(𝑛^2/3), where 𝑛 is the number of subject model parameters. If we use a constant number of forward passes to interpret each activation, then in the case where the subject and explainer model are the same size, overall compute scales as 𝑂(𝑛^5/3).
On the other hand, this is perhaps favorable compared to pre-training itself. If pre-training scales data approximately linearly with parameters, then it uses compute 𝑂(𝑛^2).′
Yeah, if you use constant compute to explain each “feature” and features are proportional to model scale, this is only O(n^2) which is the same as training compute.
However, it seems plausible to me that you actually need to look at interactions between features and so you end up with O(log(n) n^2) or even O(n^3).
Also constant factors can easily destroy you here.
[low importance]
It would be hard for the steering vectors not to win given that the method as described involves spending a comparable amount of compute to training the model in the first place (from my understanding) and more if you want to get “all of the features”.
(Not trying to push back on your comment in general or disagreeing with this line, just noting how give the gap is such that the amount of steering vector pairs hardly matter if you just steer on a single task.)
I think there’s something more general to the argument (related to SAEs seeming somewhat overkill in many ways, for strictly safety purposes).
For SAEs, the computational complexity would likely be on the same order as full pretraining; e.g. from Mapping the Mind of a Large Language Model:
‘The features we found represent a small subset of all the concepts learned by the model during training, and finding a full set of features using our current techniques would be cost-prohibitive (the computation required by our current approach would vastly exceed the compute used to train the model in the first place).’
While for activation steering approaches, the computational complexity should probably be similar to this ‘Computational requirements’ section from Language models can explain neurons in language models:
’Our methodology is quite compute-intensive. The number of activations in the subject model (neurons, attention head dimensions, residual dimensions) scales roughly as 𝑂(𝑛^2/3), where 𝑛 is the number of subject model parameters. If we use a constant number of forward passes to interpret each activation, then in the case where the subject and explainer model are the same size, overall compute scales as 𝑂(𝑛^5/3).
On the other hand, this is perhaps favorable compared to pre-training itself. If pre-training scales data approximately linearly with parameters, then it uses compute 𝑂(𝑛^2).′
Yeah, if you use constant compute to explain each “feature” and features are proportional to model scale, this is only O(n^2) which is the same as training compute.
However, it seems plausible to me that you actually need to look at interactions between features and so you end up with O(log(n) n^2) or even O(n^3).
Also constant factors can easily destroy you here.