Yes—I generally agree with this. I also realized that “interp score” is ambiguous (and the true end-to-end interp score is negligible, I agree), but what’s more clearly true is that SAE features tend to be more interpretable. This might be largely explained by “people tend to think of interpretable features as branches of a decision tree, which are sparsely activating”. But also like it was surprising to me that the top SAE features are significantly more interpretable than top PCA features
Yes—I generally agree with this. I also realized that “interp score” is ambiguous (and the true end-to-end interp score is negligible, I agree), but what’s more clearly true is that SAE features tend to be more interpretable. This might be largely explained by “people tend to think of interpretable features as branches of a decision tree, which are sparsely activating”. But also like it was surprising to me that the top SAE features are significantly more interpretable than top PCA features