Aidan Ewart comments on Comparing Anthropic’s Dictionary Learning to Ours

Aidan Ewart 8 Oct 2023 17:00 UTC
11 points
0
Note: My take does not neccesarily represent the takes of my coauthors (Hoagy, Logan, Lee, Robert) etc etc. Or it might, but they may frame it differently. Take this as strictly my take.
My take is that the goal isn’t strictly to get maximum expressive power under the assumptions detailed in Toy Models of Superposition; for instance, Anthropic found that FISTA-based dictionaries didn’t work as well as sparse autoencoders, even though they are better in that they can achive lower reconstruction loss at the same level of sparsity. We might find that the sparsity-monosemanticity link breaks down at higher levels of autoencoder expressivity, although this needs to be rigourously tested.
To answer your question: I think Hoagy thinks that tied weights are more similar to how an MLP might use features during a forward pass, which would involve extracting the feature through a simple dot-product. I’m not sure I buy this, as having untied weights is equivalent to allowing the model to express simple linear computations like ‘feature A activation = dot product along feature A direction—dot product along feature B direction’, which could be a form of denoising if A and B were mutually exclusive but non-orthogonal features.