My main insight from all this is that we should be thinking in terms of taxonomisation of features. Some are very token-specific, others are more nuanced and context-specific (in a variety of ways). The challenge of finding maximally activating text samples might be very different from one category of features to another.
Joseph and Johnny did some interesting work on this in ‘Understanding SAE Features with the Logit Lens’, taxonomizing features as partition features vs suppression features vs prediction features, and using summary statistics to distinguish them.
Joseph and Johnny did some interesting work on this in ‘Understanding SAE Features with the Logit Lens’, taxonomizing features as partition features vs suppression features vs prediction features, and using summary statistics to distinguish them.