Lee Sharkey comments on Sparsify: A mechanistic interpretability research agenda

Lee Sharkey 5 Apr 2024 14:33 UTC
2 points
0
Thanks Aidan!
I’m not sure I follow this bit:
In my mind, the reconstruction loss is more of a non-degeneracy control to encourage almost-orthogonality between features.
I don’t currently see why reconstruction would encourage features to be different directions from each other in any way unless paired with an L_{0<p<1}. And I specifically don’t mean L1, because in toy data settings with recon+L1, you can end up with features pointing in exactly the same direction.