Evan Anders comments on Sparse autoencoders find composed features in small toy models

Evan Anders 11 Apr 2024 23:47 UTC
1 point
0
Hi Demian! Sorry for the really slow response.
Yes! I agree that I was surprised that the decoder weights weren’t pointing diagonally in the case where feature occurrences were perfectly correlated. I’m not sure I really grok why this is the case. The models do learn a feature basis that can describe any of the (four) data points that can be passed into the model, but it doesn’t seem optimal either for L1 or MSE.
And—yeah, I think this is an extremely pathological case. Preliminary results look like larger dictionaries finding larger sets of features do a better job of not getting stuck in these weird local minima, and the possible number of interesting experiments here (varying frequency, varying SAE size, varying which things are correlated) is making for a pretty large exploration space.