Super cool! Some miscellaneous questions and comments as I go through it:
I see that the trees you show are using the encoded vector? What’s been your motivation for that? How do the encoded and decoded vectors tend to differ in your experience? Do you see them as meaning somewhat different things? I guess for a perfect SAE (with 0 reconstruction loss) they’d be identical, is that correct?
‘Layer 6 SAE feature 17’, ‘This feature is activated by references to making short statements or brief remarks’
This seems pretty successful to me, since the top results are about short stories / speeches.
The parts of the definition tree that don’t fit that seem similar to the ‘hedging’ sorts of definitions that you found in the semantic void work, eg ‘a group of people who are...’. I wonder whether there might be some way to filter those out and be left with the definitions more unique to the feature.
‘Layer 10 SAE feature 777’, ‘But the lack of numerical tokens was surprising’. This seems intuitively unsurprising to me—presumably the feature doesn’t activate on every instance of a number, even a number in a relevant range (eg ’94′), but only when the number is in a context that makes it likely to be a year. So just the token ‘94’ on its own won’t be that close to the feature direction. That seems like a key downside of this method, that it gives up context sensitivity (method 1 seems much stronger to me for this reason).
‘It’s not clear why (for example) some features require larger scaling factors to produce relevant trees and/or lists’. It would be really interesting to look for some value that gets maximized or minimized at the optimum scaling distance, although nothing’s immediately jumping out at me.
‘Improved control integration: Merging common controls between the two functionalities for streamlined interaction.’ Seems like it might be worth fully combining them, so that the output is always showing both, since the method 2 output doesn’t take up that much room.
Really fascinating stuff, I wonder whether @Johnny Lin would have any interest in making it possible to generate these for features in Neuronpedia.
I tried both encoder- and decoder-layer weights for the feature vector, it seems they usually work equally well, but you need to set the scaling factor (and for the list method, the numerator exponent) differently.
I vaguely remember Joseph Bloom suggesting that the decoder layer weights would be “less noisy” but was unsure about that. I haven’t got a good mental model for they they differ. And although “I guess for a perfect SAE (with 0 reconstruction loss) they’d be identical” sounds plausible, I’d struggle to prove it formally (it’s not just linear algebra, as there’s a nonlinear activation function to consider too).
I like the idea of pruning the generic parts of trees. Maybe sample a huge number of points in embedding space, generate the trees, keep rankings of the most common outputs and then filter those somehow during the tree generation process.
Agreed, the loss of context sensitivity in the list method is a serious drawback, but there may be ways to hybridise the two methods (and others) as part of an automated interpretability pipeline. There are plenty of SAE features where context isn’t really an issue, it’s just like “activates whenever any variant of the word ‘age’ appears”, in which case a list of tokens captures it easily (and the tree of definitions is arguably confusing matters, despite being entirely relevant the feature).
I also find myself wondering whether something like this could be extended to generate the maximally activating text for a feature. In the same way that for vision models it’s useful to see both the training-data examples that activate most strongly and synthetic max-activating examples, it would be really cool to be able to generate synthetic max-activating examples for SAE features.
In vision models it’s possible to approach this with gradient descent. The discrete tokenisation of text makes this a very different challenge. I suspect Jessica Rumbelow would have some insights here.
My main insight from all this is that we should be thinking in terms of taxonomisation of features. Some are very token-specific, others are more nuanced and context-specific (in a variety of ways). The challenge of finding maximally activating text samples might be very different from one category of features to another.
My main insight from all this is that we should be thinking in terms of taxonomisation of features. Some are very token-specific, others are more nuanced and context-specific (in a variety of ways). The challenge of finding maximally activating text samples might be very different from one category of features to another.
Joseph and Johnny did some interesting work on this in ‘Understanding SAE Features with the Logit Lens’, taxonomizing features as partition features vs suppression features vs prediction features, and using summary statistics to distinguish them.
Super cool! Some miscellaneous questions and comments as I go through it:
I see that the trees you show are using the encoded vector? What’s been your motivation for that? How do the encoded and decoded vectors tend to differ in your experience? Do you see them as meaning somewhat different things? I guess for a perfect SAE (with 0 reconstruction loss) they’d be identical, is that correct?
‘Layer 6 SAE feature 17’, ‘This feature is activated by references to making short statements or brief remarks’
This seems pretty successful to me, since the top results are about short stories / speeches.
The parts of the definition tree that don’t fit that seem similar to the ‘hedging’ sorts of definitions that you found in the semantic void work, eg ‘a group of people who are...’. I wonder whether there might be some way to filter those out and be left with the definitions more unique to the feature.
‘Layer 10 SAE feature 777’, ‘But the lack of numerical tokens was surprising’. This seems intuitively unsurprising to me—presumably the feature doesn’t activate on every instance of a number, even a number in a relevant range (eg ’94′), but only when the number is in a context that makes it likely to be a year. So just the token ‘94’ on its own won’t be that close to the feature direction. That seems like a key downside of this method, that it gives up context sensitivity (method 1 seems much stronger to me for this reason).
‘It’s not clear why (for example) some features require larger scaling factors to produce relevant trees and/or lists’. It would be really interesting to look for some value that gets maximized or minimized at the optimum scaling distance, although nothing’s immediately jumping out at me.
‘Improved control integration: Merging common controls between the two functionalities for streamlined interaction.’ Seems like it might be worth fully combining them, so that the output is always showing both, since the method 2 output doesn’t take up that much room.
Really fascinating stuff, I wonder whether @Johnny Lin would have any interest in making it possible to generate these for features in Neuronpedia.
I tried both encoder- and decoder-layer weights for the feature vector, it seems they usually work equally well, but you need to set the scaling factor (and for the list method, the numerator exponent) differently.
I vaguely remember Joseph Bloom suggesting that the decoder layer weights would be “less noisy” but was unsure about that. I haven’t got a good mental model for they they differ. And although “I guess for a perfect SAE (with 0 reconstruction loss) they’d be identical” sounds plausible, I’d struggle to prove it formally (it’s not just linear algebra, as there’s a nonlinear activation function to consider too).
I like the idea of pruning the generic parts of trees. Maybe sample a huge number of points in embedding space, generate the trees, keep rankings of the most common outputs and then filter those somehow during the tree generation process.
Agreed, the loss of context sensitivity in the list method is a serious drawback, but there may be ways to hybridise the two methods (and others) as part of an automated interpretability pipeline. There are plenty of SAE features where context isn’t really an issue, it’s just like “activates whenever any variant of the word ‘age’ appears”, in which case a list of tokens captures it easily (and the tree of definitions is arguably confusing matters, despite being entirely relevant the feature).
I also find myself wondering whether something like this could be extended to generate the maximally activating text for a feature. In the same way that for vision models it’s useful to see both the training-data examples that activate most strongly and synthetic max-activating examples, it would be really cool to be able to generate synthetic max-activating examples for SAE features.
In vision models it’s possible to approach this with gradient descent. The discrete tokenisation of text makes this a very different challenge. I suspect Jessica Rumbelow would have some insights here.
My main insight from all this is that we should be thinking in terms of taxonomisation of features. Some are very token-specific, others are more nuanced and context-specific (in a variety of ways). The challenge of finding maximally activating text samples might be very different from one category of features to another.
Joseph and Johnny did some interesting work on this in ‘Understanding SAE Features with the Logit Lens’, taxonomizing features as partition features vs suppression features vs prediction features, and using summary statistics to distinguish them.