eggsyntax comments on Exploring SAE features in LLMs with definition trees and token lists

eggsyntax 10 Oct 2024 21:05 UTC
1 point
0
I also find myself wondering whether something like this could be extended to generate the maximally activating text for a feature. In the same way that for vision models it’s useful to see both the training-data examples that activate most strongly and synthetic max-activating examples, it would be really cool to be able to generate synthetic max-activating examples for SAE features.
- mwatkins 16 Oct 2024 22:59 UTC
  7 points
  0
  Parent
  In vision models it’s possible to approach this with gradient descent. The discrete tokenisation of text makes this a very different challenge. I suspect Jessica Rumbelow would have some insights here.
  My main insight from all this is that we should be thinking in terms of taxonomisation of features. Some are very token-specific, others are more nuanced and context-specific (in a variety of ways). The challenge of finding maximally activating text samples might be very different from one category of features to another.
  - eggsyntax 22 Oct 2024 20:06 UTC
    1 point
    0
    Parent
    My main insight from all this is that we should be thinking in terms of taxonomisation of features. Some are very token-specific, others are more nuanced and context-specific (in a variety of ways). The challenge of finding maximally activating text samples might be very different from one category of features to another.
    Joseph and Johnny did some interesting work on this in ‘Understanding SAE Features with the Logit Lens’, taxonomizing features as partition features vs suppression features vs prediction features, and using summary statistics to distinguish them.