Really cool stuff! Evaluating SAEs based on the rate-distortion tradeoff is an extremely sensible thing to do, and I hope to see this used in future research.
One minor question/idea: have you considered quantizing different features’ activations differently? For example, one might imagine that some features are only binary (i.e. is the feature on or off) while others’ activations might be used by the model in a fine-grained way. Quantizing different features differently would be a way to exploit this to reduce the entropy. (Of course, performing this optimization and distributing bits between different features seems pretty non-trivial, but maybe a greedy-based approach (e.g. tentatively remove some number of bits from each feature, choose the feature which increases the loss the least, repeat) would work decently enough.)
Another minor question: do the rate-distortion curves of different SAEs intersect? I.e. is it the case that some SAE A achieves a lower loss than SAE B at a low bitrate, but then at a high bitrate, SAE B is better than SAE A? If so, then this might suggest a way to infer hierarchies of features from a set of SAEs: use SAE A to get low-resolution information about your input, and then use SAE B for the high-res detailed information.
Putting these questions aside, this is an area of research that I am extremely interested in, so if you are still working on this or have any new cool results, I would love to see.
On the question of quantizing different feature activations differently: Computing the description length using the entropy of a feature activation’s probability distribution is flexible enough to distinguish different types of distributions. For example, a binary distribution would have a entropy of one bit or less, and distributions spread out over more values would have larger entropies.
In our methodology, the effective float precision matters because it sets the bin width for the histogram of a feature’s activations that is then used to compute the entropy. We used the same effective float precision for all features, which was found by rounding activations to different precisions until the reconstruction or cross-entropy loss is changed by some amount.
Computing the description length using the entropy of a feature activation’s probability distribution is flexible enough to distinguish different types of distributions. For example, a binary distribution would have a entropy of one bit or less, and distributions spread out over more values would have larger entropies.
Yep, that’s completely true. Thanks for the reminder!
have you considered quantizing different features’ activations differently?
Stay tuned for our upcoming work 👀
do the rate-distortion curves of different SAEs intersect? I.e. is it the case that some SAE A achieves a lower loss than SAE B at a low bitrate, but then at a high bitrate, SAE B is better than SAE A? If so, then this might suggest a way to infer hierarchies of features from a set of SAEs: use SAE A to get low-resolution information about your input, and then use SAE B for the high-res detailed information.
This is an interesting perspective—my initial hypothesis before reading your comment was that allowing for variable bitrates for a single SAE would get around this issue but I agree that this would be interesting to test and one that we should definitely check!
With the constant bit-rate version, then I do expect that we would see something like this, though we haven’t rigorously tested that hypothesis.
I know that others are keen to have a suite of SAEs at different resolutions; my (possibly controversial) instinct is that we should be looking for a single SAE which we feel appropriately captures the properties we want. Then if we’re wanting something more coarse-grained for a different level of analysis maybe we should use a nice hierarchical SAE representation in a single SAE (as above)… Or maybe we should switch to Representation Engineering, or even more coarse-grained working at the level of heads etc. Perhaps SAEs don’t have to be all things to all people! I’d be interested to hear any opposing views that we really might want many SAEs at different resolutions though*
Thanks for your questions and thoughts, we’re really interested in pushing this further and will hopefully have some follow-up work in the not-too-distant future
EDIT: *I suspect some of the reason that people want different levels of SAEs is that they accept undesirable feature splitting as a fact of life and so want to be able to zoom in and out of features which may not be “atomic”. I’m hoping that if we can address the feature splitting problem, then at least that reason may have somewhat less pull
I know that others are keen to have a suite of SAEs at different resolutions; my (possibly controversial) instinct is that we should be looking for a single SAE which we feel appropriately captures the properties we want. Then if we’re wanting something more coarse-grained for a different level of analysis maybe we should use a nice hierarchical SAE representation in a single SAE (as above)...
This seems reasonable enough to me. For what it’s worth, the other main reason why I’m particularly interested in whether different SAEs’ rate-distortion curves intersect is because if this is the case, then comparing two SAEs becomes more difficult: depending on the bitrate that you’re evaluating at, SAE A might be better than SAE B or vice versa. On the other hand, if SAE A’s rate-distortion curve is always above SAE B, then it means that the answer to “which SAE is better?” doesn’t depend on any hyperparameter (bitrate, or conversely, acceptable loss threshold). I imagine that the former case is probably true, in which case heuristics for acceptable loss thresholds or reasonable bitrates will probably be developed. But it’d be really nice if the latter case turned out to be true, so I’m personally curious to see whether it is.
Really cool stuff! Evaluating SAEs based on the rate-distortion tradeoff is an extremely sensible thing to do, and I hope to see this used in future research.
One minor question/idea: have you considered quantizing different features’ activations differently? For example, one might imagine that some features are only binary (i.e. is the feature on or off) while others’ activations might be used by the model in a fine-grained way. Quantizing different features differently would be a way to exploit this to reduce the entropy. (Of course, performing this optimization and distributing bits between different features seems pretty non-trivial, but maybe a greedy-based approach (e.g. tentatively remove some number of bits from each feature, choose the feature which increases the loss the least, repeat) would work decently enough.)
Another minor question: do the rate-distortion curves of different SAEs intersect? I.e. is it the case that some SAE A achieves a lower loss than SAE B at a low bitrate, but then at a high bitrate, SAE B is better than SAE A? If so, then this might suggest a way to infer hierarchies of features from a set of SAEs: use SAE A to get low-resolution information about your input, and then use SAE B for the high-res detailed information.
Putting these questions aside, this is an area of research that I am extremely interested in, so if you are still working on this or have any new cool results, I would love to see.
On the question of quantizing different feature activations differently: Computing the description length using the entropy of a feature activation’s probability distribution is flexible enough to distinguish different types of distributions. For example, a binary distribution would have a entropy of one bit or less, and distributions spread out over more values would have larger entropies.
In our methodology, the effective float precision matters because it sets the bin width for the histogram of a feature’s activations that is then used to compute the entropy. We used the same effective float precision for all features, which was found by rounding activations to different precisions until the reconstruction or cross-entropy loss is changed by some amount.
Yep, that’s completely true. Thanks for the reminder!
Yeah, we hope others take on this approach too!
Stay tuned for our upcoming work 👀
This is an interesting perspective—my initial hypothesis before reading your comment was that allowing for variable bitrates for a single SAE would get around this issue but I agree that this would be interesting to test and one that we should definitely check!
With the constant bit-rate version, then I do expect that we would see something like this, though we haven’t rigorously tested that hypothesis.
I know that others are keen to have a suite of SAEs at different resolutions; my (possibly controversial) instinct is that we should be looking for a single SAE which we feel appropriately captures the properties we want. Then if we’re wanting something more coarse-grained for a different level of analysis maybe we should use a nice hierarchical SAE representation in a single SAE (as above)… Or maybe we should switch to Representation Engineering, or even more coarse-grained working at the level of heads etc. Perhaps SAEs don’t have to be all things to all people!
I’d be interested to hear any opposing views that we really might want many SAEs at different resolutions though*
Thanks for your questions and thoughts, we’re really interested in pushing this further and will hopefully have some follow-up work in the not-too-distant future
EDIT: *I suspect some of the reason that people want different levels of SAEs is that they accept undesirable feature splitting as a fact of life and so want to be able to zoom in and out of features which may not be “atomic”. I’m hoping that if we can address the feature splitting problem, then at least that reason may have somewhat less pull
This seems reasonable enough to me. For what it’s worth, the other main reason why I’m particularly interested in whether different SAEs’ rate-distortion curves intersect is because if this is the case, then comparing two SAEs becomes more difficult: depending on the bitrate that you’re evaluating at, SAE A might be better than SAE B or vice versa. On the other hand, if SAE A’s rate-distortion curve is always above SAE B, then it means that the answer to “which SAE is better?” doesn’t depend on any hyperparameter (bitrate, or conversely, acceptable loss threshold). I imagine that the former case is probably true, in which case heuristics for acceptable loss thresholds or reasonable bitrates will probably be developed. But it’d be really nice if the latter case turned out to be true, so I’m personally curious to see whether it is.