Ah yes! I tried doing exactly this to produce a sort of ‘logit lens’ to explain the SAE features. In particular I tried the following.
Take an SAE feature encoder direction and map it directly to the multimodal space to get an embedding.
Pass each of the ImageNet text prompts “A photo of a {label}.” through the CLIP text model to generate the multimodal embeddings for each ImageNet class.
Calculate the cosine similarities between the SAE embedding and the ImageNet class embeddings. Pass this through a softmax to get a probability distribution.
Look at the ImageNet labels with a high probability—this should give some explanation as to what the SAE feature is representing.
Surprisingly, this did not work at all! I only spent a small amount of time trying to get this to work (<1day), so I’m planning to try again. If I remember correctly, I also tried the same analysis for the decoder feature vector and also tried shifting by the decoder bias vector too—both of these didn’t seem to provide good ImageNet class explanations of the SAE features. I will try doing this again and I can let you know how it goes!
Awesome work! I couldn’t find anywhere that specified how sparse your SAE is (eg your l0). I would be interested to hear what l0 you got!