Robert_AIZI comments on Comparing Anthropic’s Dictionary Learning to Ours

Robert_AIZI 9 Oct 2023 13:18 UTC
5 points
0
Good question! I started writing and when I looked up I had a half-dozen takes, so sorry if these are rambly. Also let me give the caveat that I wasn’t on the training side of the project so these are less informed than Hoagy, Logan, and Aidan’s views:
- +1 to Aidan’s answer.
- I wish we could resolve tied vs untied purely via “use whichever makes things more interpretable by metric X”, but right now I don’t think our interpretability metrics are fine-grained and reliable enough to make that decision for us yet.
- I expect a lot of future work will ask these architectural questions about the autoencoder architecture, and like transformers in general will settle on some guidelines of what works best.
- Tied weights are expressive enough to pass the test of “if you squint and ignore the nonlinearity, they should still work”. In particular, (ignoring bias terms) we’re trying to make $W^{T} R e L U (W x) = x$ , so we need “ $W^{T} R e L U W = I$ ”, and many matrices satisfy $W^{T} W = I$ .
- Tied weights certainly make it easier to explain the autoencoder—“this vector was very far in the X direction, so in its reconstruction we add back in a term along the X direction” vs adding back a vector in a (potentially different) Y direction.
- Downstream of this, tied weights make ablations make more sense to me. Let’s say you have some input A that activates direction X at a score of 5, so the autoencoder’s reconstruction is A≈ 5X+[other stuff]. In the ablation, we replace A with A-5X, and if you feed A-5X into the sparse autoencoder, the X direction will activate 0 so the reconstruction will be A-5X≈0X+[different other stuff due to interference]. Therefore the only difference in the accuracy of your reconstruction will be how much the other feature activations are changed by interference. But if your reconstructions use the Y vector instead, then when you feed in A-5X, you’ll replace A≈5Y+[other stuff] with A-5X≈0Y+[different other stuff], so you’ve also changed things by 5X-5Y.
- If we’re abandoning the tied weights and just want to decompose the layer into any sparse code, why not just make the sparse autoencoder deeper, throw in smooth activations instead of ReLU, etc? That’s not rhetorical, I honestly don’t know… probably you’d still want ReLU at the end to clamp your activations to be positive. Probably you don’t need too much nonlinearity because the model itself “reads out” of the residual stream via linear operations. I think the thing to try here is trying to make the sparse autoencoder architecture as similar to the language model architecture as possible, so that you can find the “real” “computational factors”.