Sam Marks comments on How useful is mechanistic interpretability?

Sam Marks 3 Dec 2023 6:09 UTC
3 points
0
Another metric is: comparing the similarity between two dictionaries using mean max cosine similarity (where one of the dictionaries is treated as the ground truth), we’ve found that two dictionaries trained from different random seeds on the same (non-randomized) model are highly similar (>.95), whereas dictionaries trained on a randomized model and an non-randomized model are dissimilar (<.3 IIRC, but I don’t have the data on hand).