Rosco-Hunter comments on Towards Monosemanticity: Decomposing Language Models With Dictionary Learning

Rosco-Hunter 25 Apr 2024 12:28 UTC
1 point
0
This was a really interesting paper; however, I was left with one question. Can anyone argue why exactly the model is motivated to learn a much more complex function than the identity map? An auto-encoder whose latent space is much smaller than the input is forced to learn an interesting map; however, I can’t see why a highly over-parameterised auto-encoder wouldn’t simply learn something close to an identity map. Is it somehow the regularisation or the bias terms? I’d love to hear an argument for why the auto-encoder is likely to learn these mono-semantic features as opposed to an identity map.
- Zac Hatfield-Dodds 25 Apr 2024 15:35 UTC
  2 points
  0
  Parent
  It’s a sparse autoencoder because part of the loss function is an L1 penalty encouraging sparsity in the hidden layer. Otherwise, it would indeed learn a simple identity map!