With the ITO experiments, my first guess would be that reoptimizing the sparse approximation problem is mostly relearning the encoder, but with some extra uninterpretable hacks for low activation levels that happen to improve reconstruction. In other words, I’m guessing that the boost in reconstruction accuracy (and therefore loss recovered) is mostly not due to better recognizing the presence of interpretable features, but by doing fiddly uninterpretable things at low activation levels.
I’m not really sure how to operationalize this into a prediction. Maybe something like: if you pick some small-ish threshold T (maybe like T=3 based on the plot copied below) and round activations less than T down to 0 (for both the ITO encoder and the original encoder), then you’ll no longer see that the ITO encoder outperforms the original one.
Maybe this is on us for not including enough detail in the post, but I’m pretty confident that you would lose your bet no matter how you operationalised it. We did compare ITO to using the encoder to pick features (using the top k) then optimising the weights on those feature at inference time, and to learning a post hoc scale and to address the ‘shrinkage’ problem where the encoder systematically underweights features, and gradient pursuit consistently outperformed both of them, so I think that gradient pursuit doesn’t just fiddle round with low weight, it also chooses features ‘better’.
With respect to your threshold thing; the structure of the specific algorithm we used (gradient pursuit) means that if GP has selected a feature, it tends to assign it quite a high weight, so I don’t think that would do much; SAE encoders tend to have much more features close to zero, because it’s structurally hard for them to avoid doing this. I would almost turn around your argument; i think that low-activating features in a normal SAE are likely to not be particularly interesting or interpretable either, as the structure of an SAE makes it difficult for them to avoid having features that have interference activate spuriously.
One quirk of gradient pursuit that is a bit weird is that it will almost always choose a new feature which is orthogonal to the span of features selected so far, which does seem a little artificial.
Whether the way that it chooses features better is actually better for interpretability is difficult to say. As we say in the post, we did manually inspect some examples and we couldn’t spot any obvious problems with the ITO decomposition, but we haven’t done a properly systematic double blind comparison of ITO to encoder ‘explanations’ in terms of interpretability because it’s quite expensive for us in terms of time.
I think that it’s too early to say whether ITO is ‘really’ helping or not, but I am pretty confident it’s worth more exploration, which is why we are spreading the word about this specific algorithm in this snippet (even though we didn’t invent it). I think training models using GP at train time, getting rid of the SAE framework altogether, is also worth exploring to be honest. But at the moment it’s still quite hard to give sparse decompositions an ‘interpretability score’ which is objective and not too expensive to make, so it’s a bit difficult to see how we would evaluate something like this. (I think auto-interp could be a reasonable way of screening ideas like this once we are running it more easily)
I think there is a fairly reasonable theoretical argument that non-SAE decompositions won’t work well for superposition (because the NN can’t actually be using an iterative algorithm to read features) but I do think that I haven’t really seen any empirical evidence that this is either true or false to be honest, and I don’t think we should rule out that non-SAE methods would just work loads better; they do work much better for almost every other sparse optimisation algorithm afaik.
With the ITO experiments, my first guess would be that reoptimizing the sparse approximation problem is mostly relearning the encoder, but with some extra uninterpretable hacks for low activation levels that happen to improve reconstruction. In other words, I’m guessing that the boost in reconstruction accuracy (and therefore loss recovered) is mostly not due to better recognizing the presence of interpretable features, but by doing fiddly uninterpretable things at low activation levels.
I’m not really sure how to operationalize this into a prediction. Maybe something like: if you pick some small-ish threshold T (maybe like T=3 based on the plot copied below) and round activations less than T down to 0 (for both the ITO encoder and the original encoder), then you’ll no longer see that the ITO encoder outperforms the original one.
Maybe this is on us for not including enough detail in the post, but I’m pretty confident that you would lose your bet no matter how you operationalised it. We did compare ITO to using the encoder to pick features (using the top k) then optimising the weights on those feature at inference time, and to learning a post hoc scale and to address the ‘shrinkage’ problem where the encoder systematically underweights features, and gradient pursuit consistently outperformed both of them, so I think that gradient pursuit doesn’t just fiddle round with low weight, it also chooses features ‘better’.
With respect to your threshold thing; the structure of the specific algorithm we used (gradient pursuit) means that if GP has selected a feature, it tends to assign it quite a high weight, so I don’t think that would do much; SAE encoders tend to have much more features close to zero, because it’s structurally hard for them to avoid doing this. I would almost turn around your argument; i think that low-activating features in a normal SAE are likely to not be particularly interesting or interpretable either, as the structure of an SAE makes it difficult for them to avoid having features that have interference activate spuriously.
One quirk of gradient pursuit that is a bit weird is that it will almost always choose a new feature which is orthogonal to the span of features selected so far, which does seem a little artificial.
Whether the way that it chooses features better is actually better for interpretability is difficult to say. As we say in the post, we did manually inspect some examples and we couldn’t spot any obvious problems with the ITO decomposition, but we haven’t done a properly systematic double blind comparison of ITO to encoder ‘explanations’ in terms of interpretability because it’s quite expensive for us in terms of time.
I think that it’s too early to say whether ITO is ‘really’ helping or not, but I am pretty confident it’s worth more exploration, which is why we are spreading the word about this specific algorithm in this snippet (even though we didn’t invent it). I think training models using GP at train time, getting rid of the SAE framework altogether, is also worth exploring to be honest. But at the moment it’s still quite hard to give sparse decompositions an ‘interpretability score’ which is objective and not too expensive to make, so it’s a bit difficult to see how we would evaluate something like this. (I think auto-interp could be a reasonable way of screening ideas like this once we are running it more easily)
I think there is a fairly reasonable theoretical argument that non-SAE decompositions won’t work well for superposition (because the NN can’t actually be using an iterative algorithm to read features) but I do think that I haven’t really seen any empirical evidence that this is either true or false to be honest, and I don’t think we should rule out that non-SAE methods would just work loads better; they do work much better for almost every other sparse optimisation algorithm afaik.