Arthur Conmy comments on [Interim research report] Taking features out of superposition with sparse autoencoders

Arthur Conmy 3 Jan 2024 16:43 UTC
LW: 3 AF: 2
0
AF
I previously thought that L1 penalties were just exactly what you wanted to do sparse reconstruction.
Thinking about your undershooting claim, I came up with a toy example that made it obvious to me that the Anthropic loss function was not optimal: suppose you are role-playing a single-feature SAE reconstructing the number 2, and are given loss equal to the squared error of your guess, plus the norm of your guess. Then guessing x>0 gives loss minimized at x=3/2, not 2
- Lee Sharkey 3 Jan 2024 21:50 UTC
  LW: 1 AF: 1
  0
  AF Parent
  Makes sense! Thanks!