‘The features we found represent a small subset of all the concepts learned by the model during training, and finding a full set of features using our current techniques would be cost-prohibitive (the computation required by our current approach would vastly exceed the compute used to train the model in the first place).’
’Our methodology is quite compute-intensive. The number of activations in the subject model (neurons, attention head dimensions, residual dimensions) scales roughly as 𝑂(𝑛^2/3), where 𝑛 is the number of subject model parameters. If we use a constant number of forward passes to interpret each activation, then in the case where the subject and explainer model are the same size, overall compute scales as 𝑂(𝑛^5/3).
On the other hand, this is perhaps favorable compared to pre-training itself. If pre-training scales data approximately linearly with parameters, then it uses compute 𝑂(𝑛^2).′
Yeah, if you use constant compute to explain each “feature” and features are proportional to model scale, this is only O(n^2) which is the same as training compute.
However, it seems plausible to me that you actually need to look at interactions between features and so you end up with O(log(n) n^2) or even O(n^3).
Also constant factors can easily destroy you here.
I think there’s something more general to the argument (related to SAEs seeming somewhat overkill in many ways, for strictly safety purposes).
For SAEs, the computational complexity would likely be on the same order as full pretraining; e.g. from Mapping the Mind of a Large Language Model:
‘The features we found represent a small subset of all the concepts learned by the model during training, and finding a full set of features using our current techniques would be cost-prohibitive (the computation required by our current approach would vastly exceed the compute used to train the model in the first place).’
While for activation steering approaches, the computational complexity should probably be similar to this ‘Computational requirements’ section from Language models can explain neurons in language models:
’Our methodology is quite compute-intensive. The number of activations in the subject model (neurons, attention head dimensions, residual dimensions) scales roughly as 𝑂(𝑛^2/3), where 𝑛 is the number of subject model parameters. If we use a constant number of forward passes to interpret each activation, then in the case where the subject and explainer model are the same size, overall compute scales as 𝑂(𝑛^5/3).
On the other hand, this is perhaps favorable compared to pre-training itself. If pre-training scales data approximately linearly with parameters, then it uses compute 𝑂(𝑛^2).′
Yeah, if you use constant compute to explain each “feature” and features are proportional to model scale, this is only O(n^2) which is the same as training compute.
However, it seems plausible to me that you actually need to look at interactions between features and so you end up with O(log(n) n^2) or even O(n^3).
Also constant factors can easily destroy you here.