Lee Sharkey comments on [Interim research report] Taking features out of superposition with sparse autoencoders

Lee Sharkey 15 Dec 2022 1:28 UTC
LW: 3 AF: 2
0
AF
There should be a neat theoretical reason for the clean power law where L1 loss becomes too big. But it doesn’t make intuitive sense to me—it seems like if you just add some useless entries in the dictionary, the effect of losing one of the dimensions you do use on reconstruction loss won’t change, so why should the point where L1 loss becomes too big change? So unless you have a bug (or some weird design choice that divides loss by number of dimensions), those extra dimensions would have to be changing something.
The L1 loss on the activations does indeed take the mean activation value. I think it’s probably a more practical choice than simply taking the sum because it creates independence between hyperparameters: We wouldn’t want the size of the sparsity loss to change wildly relative to the reconstruction loss when we change the dictionary size. In the methods section I forgot to include the averaging terms. I’ve updated the text in the article. Good spot, thanks!

I’d definitely be interested in you including this as a variable in the toy data, and seeing how it affects the hyperparameter search heuristics.
Yeah I think this is probably worth checking too. We probably wouldn’t need to have too many different values to get a rough sense of its effect.

Fig. 9 is cursed. Is there a problem with estimating from just one component of the loss?
Yeah it kind of is… It’s probably better to just look at each loss component separately. Very helpful feedback, thanks!