Dmitry Vaintrob comments on A bird’s eye view of ARC’s research

Dmitry Vaintrob 27 Oct 2024 2:45 UTC
LW: 2 AF: 1
0
AF
Yes—I generally agree with this. I also realized that “interp score” is ambiguous (and the true end-to-end interp score is negligible, I agree), but what’s more clearly true is that SAE features tend to be more interpretable. This might be largely explained by “people tend to think of interpretable features as branches of a decision tree, which are sparsely activating”. But also like it was surprising to me that the top SAE features are significantly more interpretable than top PCA features