Yeah I agree with everything you say; it’s just I was trying to remind myself of enough of SLT to give a a ‘five minute pitch’ for SLT to other people, and I didn’t like the idea that I’m hanging it of the ReLU.
I guess the intuition behind the hierarchical nature of the models leading to singularities is the permutation symmetry between the hidden channels, which is kind of an easy thing to understand.
I get and agree with your point about approximate equivalences, though I have to say that I think we should be careful! One reason I’m interested in SLT is I spent a lot of time during my PhD on Bayesian approximations to NN posteriors. I think SLT is one reasonable explanation of why this. never yielded great results, but I think hand-wavy intuitions about ‘oh well the posterior is probably-sorta-gaussian’ played a big role in it’s longevity as an idea.
yeah it’s not totally clear what this ‘nearly singular’ thing would mean? Intuitively, it might be that there’s a kind of ‘hidden singularity’ in the space of this model that might affect the behaviour, like the singularity in a dynamic model with a phase transition. but im just guessing
Maybe this is on us for not including enough detail in the post, but I’m pretty confident that you would lose your bet no matter how you operationalised it. We did compare ITO to using the encoder to pick features (using the top k) then optimising the weights on those feature at inference time, and to learning a post hoc scale and to address the ‘shrinkage’ problem where the encoder systematically underweights features, and gradient pursuit consistently outperformed both of them, so I think that gradient pursuit doesn’t just fiddle round with low weight, it also chooses features ‘better’.
With respect to your threshold thing; the structure of the specific algorithm we used (gradient pursuit) means that if GP has selected a feature, it tends to assign it quite a high weight, so I don’t think that would do much; SAE encoders tend to have much more features close to zero, because it’s structurally hard for them to avoid doing this. I would almost turn around your argument; i think that low-activating features in a normal SAE are likely to not be particularly interesting or interpretable either, as the structure of an SAE makes it difficult for them to avoid having features that have interference activate spuriously.
One quirk of gradient pursuit that is a bit weird is that it will almost always choose a new feature which is orthogonal to the span of features selected so far, which does seem a little artificial.
Whether the way that it chooses features better is actually better for interpretability is difficult to say. As we say in the post, we did manually inspect some examples and we couldn’t spot any obvious problems with the ITO decomposition, but we haven’t done a properly systematic double blind comparison of ITO to encoder ‘explanations’ in terms of interpretability because it’s quite expensive for us in terms of time.
I think that it’s too early to say whether ITO is ‘really’ helping or not, but I am pretty confident it’s worth more exploration, which is why we are spreading the word about this specific algorithm in this snippet (even though we didn’t invent it). I think training models using GP at train time, getting rid of the SAE framework altogether, is also worth exploring to be honest. But at the moment it’s still quite hard to give sparse decompositions an ‘interpretability score’ which is objective and not too expensive to make, so it’s a bit difficult to see how we would evaluate something like this. (I think auto-interp could be a reasonable way of screening ideas like this once we are running it more easily)
I think there is a fairly reasonable theoretical argument that non-SAE decompositions won’t work well for superposition (because the NN can’t actually be using an iterative algorithm to read features) but I do think that I haven’t really seen any empirical evidence that this is either true or false to be honest, and I don’t think we should rule out that non-SAE methods would just work loads better; they do work much better for almost every other sparse optimisation algorithm afaik.