Thanks for copying this over!
For what it’s worth, my current view on SAEs is that they remain a pretty neat unsupervised technique for making (partial) sense of activations, but they fit more into the general category of unsupervised learning techniques, e.g. clustering algorithms, than as a method that’s going to discover the “true representational directions” used by the language model. And, as such, they share many of the pros and cons of unsupervised techniques in general:[1]
(Pros) They may be useful / efficient for getting a first pass understanding of what’s going in a model / with some data (indeed many of their success stories have this flavour).
(Cons) They are hit and miss—often not carving up the data in the way you’d prefer, with weird omissions or gerrymandered boundaries you need to manually correct for. Once you have a hypothesis, a supervised method will likely give you better results.
I think this means SAEs could still be useful for generating hypotheses when trying to understand model behaviour, and and I really like the CLTs papers in this regard.[2] However, it’s still unclear whether they are better for hypothesis generation than alternative techniques, particularly techniques that have other advantages, like the ability to be used with limited model access (i.e. black-box techniques) or techniques that don’t require paying a large up-front cost before they can be used on a model.
I largely agree with your updates 1 and 2 above, although on 2 I still think it’s plausible that while many “why is the model doing X?” type questions can be answered with black-box techniques today, this may not continue to hold into the future, which is why I still view interp as a worthwhile research direction. This does make it important though to always try strong baselines on any new project and only get excited when interp sheds light on problems that genuinely seem hard to solve using these baselines.[3]
- ↩︎
When I say unsupervised learning, I’m using this term in its conventional sense, e.g. clustering algorithms, manifold learning, etc; not in the sense of tasks like language model pre-training which I sometimes see referred to as unsupervised.
- ↩︎
Particularly its emphasis on techniques to prune massive attribution graphs, improving tooling for making sense of the results, and accepting that some manual adjustment of the decompositions produced by CLTs may be necessary because we’re giving up on the idea that CLTs / SAEs are uncovering a “true basis”.
- ↩︎
And it does seem that black box methods often suffice (in the sense of giving “good enough explanations” for whatever we need these explanations for) when we try to do this. Though this could just be—as you say—because of bad judgement. I’d definitely appreciate suggestions for better downstream tasks we should try!
Sure, I agree that, as we point out in the post, this penalty may not be targeting the right thing, or could be targeting it in the wrong way. We shared this more as a proof of concept that others may like to build on and don’t claim it’s a superior solution to standard JumpReLU training.
A minor quibble on the ad-hoc point: while I completely agree about the pitfalls of ad-hoc definitions, I don’t think the same arguments apply about ad-hoc training procedures. As long as your evaluation metrics measure the thing you actually care about, ML has a long history of ad-hoc approaches to optimising those metrics performing surprisingly well. Having said that though, I agree it would be great to see more research into what’s really going on with these dense features, and this leading into a more principled approach to dealing with them! (Whether that turns out to be better understanding how to interpret them or improving SAE training to fix them.)