Lee Sharkey comments on Examining Language Model Performance with Reconstructed Activations using Sparse Autoencoders

Lee Sharkey 28 Feb 2024 14:45 UTC
7 points
0
This is a good idea and is something we’re (Apollo + MATS stream) working on atm. We’re planning on releasing our agenda related to this and, of course, results whenever they’re ready to share.
- Joseph Bloom 28 Feb 2024 16:56 UTC
  2 points
  0
  Parent
  I’ve heard this idea floated a few times and am a little worried that “When a measure becomes a target, it ceases to be a good measure” will apply here. OTOH, you can directly check whether the MSE / variance explained diverges significantly so at least you can track the resulting SAE’s use for decomposition. I’d be pretty surprised if an SAE trained with this objective became vastly more performant and you could check whether downstream activations of the reconstructed activations were off distribution. So overall, I’m pretty excited to see what you get!