I’d definitely be interested if you have any takes on tied vs. untied weights.
It seems like the goal is to get the maximum expressive power for the autoencoder, while still learning features that are linear. So are untied weights good because they’re more expressive, or are they bad because they encourage the model to start exploiting nonlinearity?
Note: My take does not neccesarily represent the takes of my coauthors (Hoagy, Logan, Lee, Robert) etc etc. Or it might, but they may frame it differently. Take this as strictly my take.
My take is that the goal isn’t strictly to get maximum expressive power under the assumptions detailed in Toy Models of Superposition; for instance, Anthropic found that FISTA-based dictionaries didn’t work as well as sparse autoencoders, even though they are better in that they can achive lower reconstruction loss at the same level of sparsity. We might find that the sparsity-monosemanticity link breaks down at higher levels of autoencoder expressivity, although this needs to be rigourously tested.
To answer your question: I think Hoagy thinks that tied weights are more similar to how an MLP might use features during a forward pass, which would involve extracting the feature through a simple dot-product. I’m not sure I buy this, as having untied weights is equivalent to allowing the model to express simple linear computations like ‘feature A activation = dot product along feature A direction—dot product along feature B direction’, which could be a form of denoising if A and B were mutually exclusive but non-orthogonal features.
Good question! I started writing and when I looked up I had a half-dozen takes, so sorry if these are rambly. Also let me give the caveat that I wasn’t on the training side of the project so these are less informed than Hoagy, Logan, and Aidan’s views:
+1 to Aidan’s answer.
I wish we could resolve tied vs untied purely via “use whichever makes things more interpretable by metric X”, but right now I don’t think our interpretability metrics are fine-grained and reliable enough to make that decision for us yet.
I expect a lot of future work will ask these architectural questions about the autoencoder architecture, and like transformers in general will settle on some guidelines of what works best.
Tied weights are expressive enough to pass the test of “if you squint and ignore the nonlinearity, they should still work”. In particular, (ignoring bias terms) we’re trying to make WTReLU(Wx)=x, so we need “WTReLUW=I”, and many matrices satisfy WTW=I.
Tied weights certainly make it easier to explain the autoencoder—“this vector was very far in the X direction, so in its reconstruction we add back in a term along the X direction” vs adding back a vector in a (potentially different) Y direction.
Downstream of this, tied weights make ablations make more sense to me. Let’s say you have some input A that activates direction X at a score of 5, so the autoencoder’s reconstruction is A≈ 5X+[other stuff]. In the ablation, we replace A with A-5X, and if you feed A-5X into the sparse autoencoder, the X direction will activate 0 so the reconstruction will be A-5X≈0X+[different other stuff due to interference]. Therefore the only difference in the accuracy of your reconstruction will be how much the other feature activations are changed by interference. But if your reconstructions use the Y vector instead, then when you feed in A-5X, you’ll replace A≈5Y+[other stuff] with A-5X≈0Y+[different other stuff], so you’ve also changed things by 5X-5Y.
If we’re abandoning the tied weights and just want to decompose the layer into any sparse code, why not just make the sparse autoencoder deeper, throw in smooth activations instead of ReLU, etc? That’s not rhetorical, I honestly don’t know… probably you’d still want ReLU at the end to clamp your activations to be positive. Probably you don’t need too much nonlinearity because the model itself “reads out” of the residual stream via linear operations. I think the thing to try here is trying to make the sparse autoencoder architecture as similar to the language model architecture as possible, so that you can find the “real” “computational factors”.
I’d definitely be interested if you have any takes on tied vs. untied weights.
It seems like the goal is to get the maximum expressive power for the autoencoder, while still learning features that are linear. So are untied weights good because they’re more expressive, or are they bad because they encourage the model to start exploiting nonlinearity?
Note: My take does not neccesarily represent the takes of my coauthors (Hoagy, Logan, Lee, Robert) etc etc. Or it might, but they may frame it differently. Take this as strictly my take.
My take is that the goal isn’t strictly to get maximum expressive power under the assumptions detailed in Toy Models of Superposition; for instance, Anthropic found that FISTA-based dictionaries didn’t work as well as sparse autoencoders, even though they are better in that they can achive lower reconstruction loss at the same level of sparsity. We might find that the sparsity-monosemanticity link breaks down at higher levels of autoencoder expressivity, although this needs to be rigourously tested.
To answer your question: I think Hoagy thinks that tied weights are more similar to how an MLP might use features during a forward pass, which would involve extracting the feature through a simple dot-product. I’m not sure I buy this, as having untied weights is equivalent to allowing the model to express simple linear computations like ‘feature A activation = dot product along feature A direction—dot product along feature B direction’, which could be a form of denoising if A and B were mutually exclusive but non-orthogonal features.
Good question! I started writing and when I looked up I had a half-dozen takes, so sorry if these are rambly. Also let me give the caveat that I wasn’t on the training side of the project so these are less informed than Hoagy, Logan, and Aidan’s views:
+1 to Aidan’s answer.
I wish we could resolve tied vs untied purely via “use whichever makes things more interpretable by metric X”, but right now I don’t think our interpretability metrics are fine-grained and reliable enough to make that decision for us yet.
I expect a lot of future work will ask these architectural questions about the autoencoder architecture, and like transformers in general will settle on some guidelines of what works best.
Tied weights are expressive enough to pass the test of “if you squint and ignore the nonlinearity, they should still work”. In particular, (ignoring bias terms) we’re trying to make WTReLU(Wx)=x, so we need “WTReLUW=I”, and many matrices satisfy WTW=I.
Tied weights certainly make it easier to explain the autoencoder—“this vector was very far in the X direction, so in its reconstruction we add back in a term along the X direction” vs adding back a vector in a (potentially different) Y direction.
Downstream of this, tied weights make ablations make more sense to me. Let’s say you have some input A that activates direction X at a score of 5, so the autoencoder’s reconstruction is A≈ 5X+[other stuff]. In the ablation, we replace A with A-5X, and if you feed A-5X into the sparse autoencoder, the X direction will activate 0 so the reconstruction will be A-5X≈0X+[different other stuff due to interference]. Therefore the only difference in the accuracy of your reconstruction will be how much the other feature activations are changed by interference. But if your reconstructions use the Y vector instead, then when you feed in A-5X, you’ll replace A≈5Y+[other stuff] with A-5X≈0Y+[different other stuff], so you’ve also changed things by 5X-5Y.
If we’re abandoning the tied weights and just want to decompose the layer into any sparse code, why not just make the sparse autoencoder deeper, throw in smooth activations instead of ReLU, etc? That’s not rhetorical, I honestly don’t know… probably you’d still want ReLU at the end to clamp your activations to be positive. Probably you don’t need too much nonlinearity because the model itself “reads out” of the residual stream via linear operations. I think the thing to try here is trying to make the sparse autoencoder architecture as similar to the language model architecture as possible, so that you can find the “real” “computational factors”.