Evan Anders comments on Sparse autoencoders find composed features in small toy models

Evan Anders 29 Apr 2024 16:24 UTC
3 points
2
Hi Lawrence! Thanks so much for this comment and for spelling out (with the math) where and how our thinking and dataset construction were poorly setup. I agree with your analysis and critiques of the first dataset. The biggest problem with that dataset in my eyes (as you point out): the true actual features in the data are not the ones that I wanted them to be (and claimed them to be), so the SAE isn’t really learning “composed features.”
In retrospect, I wish I had just skipped onto the second dataset which had a result that was (to me) surprising at the time of the post. But there I hadn’t thought about looking at the PCs in hidden space, and didn’t realize those were the diagonals. This makes a lot of sense, and now I understand much better why the SAE recovers those.
My big takeaway from this whole post is: I need to think on this all a lot more! I’ve struggled a lot to construct a dataset that successfully has some of the interesting characteristics of language model data and also has interesting compositions / correlations. After a month of playing around and reflection, I don’t think the “two sets of one-hot features” thing we did here is the best way to study this kind of phenomenon.