Lee Sharkey comments on [Interim research report] Taking features out of superposition with sparse autoencoders

Lee Sharkey 2 Jan 2024 14:51 UTC
LW: 1 AF: 1
0
AF
Great! I’m curious, what was it about the sparsity penalty that you changed your mind about?
- Arthur Conmy 3 Jan 2024 16:43 UTC
  LW: 3 AF: 2
  0
  AF Parent
  I previously thought that L1 penalties were just exactly what you wanted to do sparse reconstruction.
  Thinking about your undershooting claim, I came up with a toy example that made it obvious to me that the Anthropic loss function was not optimal: suppose you are role-playing a single-feature SAE reconstructing the number 2, and are given loss equal to the squared error of your guess, plus the norm of your guess. Then guessing x>0 gives loss minimized at x=3/2, not 2
  - Lee Sharkey 3 Jan 2024 21:50 UTC
    LW: 1 AF: 1
    0
    AF Parent
    Makes sense! Thanks!