On the paper “Towards Monosemanticity: Decomposing Vision Models with Dictionary Learning”, looking at the images for feature 8, I think it’s not just “pointy objects with a metallic aspect”, but “a serried array of pointy objects with a metallic aspect”. So I predict that the Sword Throne from Game of Thrones should trigger it.
That suggests a way to test the verbal descriptions: take a verbal description, feed it to GPT-4V along with a random sample of images from Imagenet, ask for the images that fit the description, and see how well that classifier matches the behavior of the feature (with a metaparameter for feature threshold). Or ask for a 0-4 score and plot that against the feature activation.
Overall, the papers you link to include several impressive and interesting ones — well worth the reading.
On the paper “Towards Monosemanticity: Decomposing Vision Models with Dictionary Learning”, looking at the images for feature 8, I think it’s not just “pointy objects with a metallic aspect”, but “a serried array of pointy objects with a metallic aspect”. So I predict that the Sword Throne from Game of Thrones should trigger it.
That suggests a way to test the verbal descriptions: take a verbal description, feed it to GPT-4V along with a random sample of images from Imagenet, ask for the images that fit the description, and see how well that classifier matches the behavior of the feature (with a metaparameter for feature threshold). Or ask for a 0-4 score and plot that against the feature activation.
Overall, the papers you link to include several impressive and interesting ones — well worth the reading.