I decided to do a check by tallying the “More Safety Relevant Features” from the 1M SAE to see if they reoccur in the 34M SAE (in some related form).
I don’t think we can interpret their list of safety-relevant features as exhaustive. I’d bet (80% confidence) that we could find 34M features corresponding to at least some of the 1M features you listed, given access to their UMAP browser. Unfortunately we can’t do this without Anthropic support.
Non-exhaustiveness seems plausible, but then I’m curious how they found these features. They don’t seem to be constrained to an index range, and there seem to be nicely matched pairs like this, which I think isn’t indicative of random checking:
I don’t think we can interpret their list of safety-relevant features as exhaustive. I’d bet (80% confidence) that we could find 34M features corresponding to at least some of the 1M features you listed, given access to their UMAP browser. Unfortunately we can’t do this without Anthropic support.
Non-exhaustiveness seems plausible, but then I’m curious how they found these features. They don’t seem to be constrained to an index range, and there seem to be nicely matched pairs like this, which I think isn’t indicative of random checking: