ryan_greenblatt comments on How useful is mechanistic interpretability?

ryan_greenblatt 3 Dec 2023 3:49 UTC
3 points
0
(The results for correlations from auto-interp are less clear: they find similar correlation coefficients with and without weight randomization. However, they find that this might be due to single token features on the part of the randomized transformer and when you ignore these features (or correct in some other way I’m forgetting?), the SAE on an actual transformer indeed has higher correlation.)
- Sam Marks 3 Dec 2023 6:09 UTC
  3 points
  0
  Parent
  Another metric is: comparing the similarity between two dictionaries using mean max cosine similarity (where one of the dictionaries is treated as the ground truth), we’ve found that two dictionaries trained from different random seeds on the same (non-randomized) model are highly similar (>.95), whereas dictionaries trained on a randomized model and an non-randomized model are dissimilar (<.3 IIRC, but I don’t have the data on hand).