it turns out that enforcing sparsity (and doing an SAE) gives better interp scores than doing PCA
It’s not clear to me this is true exactly. As in, suppose I want to explain as much of what a transformer is doing as possible with some amount of time. Would I better off looking at PCA features vs SAE features?
Yes, most/many SAE features are easier to understand than PCA features, but each SAE feature (which is actually sparse) is only a tiny, tiny fraction of what the model is doing. So, it might be that you’d get better interp scores (in terms of how much of what the model is doing) with PCA.
Certainly, if we do literal “fraction of loss explained by human written explanations” both PCA and SAE recover approximately 0% of training compute.
I do think you can often learn very specific more interesting things with SAEs and for various applications SAEs are more useful, but in terms of some broader understanding, I don’t think SAEs clearly are “better” than PCA. (There are also various cases where PCA on some particular distribution is totally the right tool for the job.)
Certainly, I don’t think it has been shown that we can get non-negligible interp scores with SAEs.
To be clear, I do think we learn something from the fact that SAE features seem to often/mostly at least roughly correspond to some human concept, but I think the fact that there are vastly more SAE features vs PCA features does matter! (PCA was never trying to decompose into this many parts.)
Yes—I generally agree with this. I also realized that “interp score” is ambiguous (and the true end-to-end interp score is negligible, I agree), but what’s more clearly true is that SAE features tend to be more interpretable. This might be largely explained by “people tend to think of interpretable features as branches of a decision tree, which are sparsely activating”. But also like it was surprising to me that the top SAE features are significantly more interpretable than top PCA features
It’s not clear to me this is true exactly. As in, suppose I want to explain as much of what a transformer is doing as possible with some amount of time. Would I better off looking at PCA features vs SAE features?
Yes, most/many SAE features are easier to understand than PCA features, but each SAE feature (which is actually sparse) is only a tiny, tiny fraction of what the model is doing. So, it might be that you’d get better interp scores (in terms of how much of what the model is doing) with PCA.
Certainly, if we do literal “fraction of loss explained by human written explanations” both PCA and SAE recover approximately 0% of training compute.
I do think you can often learn very specific more interesting things with SAEs and for various applications SAEs are more useful, but in terms of some broader understanding, I don’t think SAEs clearly are “better” than PCA. (There are also various cases where PCA on some particular distribution is totally the right tool for the job.)
Certainly, I don’t think it has been shown that we can get non-negligible interp scores with SAEs.
To be clear, I do think we learn something from the fact that SAE features seem to often/mostly at least roughly correspond to some human concept, but I think the fact that there are vastly more SAE features vs PCA features does matter! (PCA was never trying to decompose into this many parts.)
Yes—I generally agree with this. I also realized that “interp score” is ambiguous (and the true end-to-end interp score is negligible, I agree), but what’s more clearly true is that SAE features tend to be more interpretable. This might be largely explained by “people tend to think of interpretable features as branches of a decision tree, which are sparsely activating”. But also like it was surprising to me that the top SAE features are significantly more interpretable than top PCA features