StefanHex comments on StefanHex’s Shortform

StefanHex 6 Feb 2025 20:16 UTC
12 points
2
this seems concerning.
I feel like my post appears overly dramatic; I’m not very surprised and don’t consider this the strongest evidence against SAEs. It’s an experiment I ran a while ago and it hasn’t changed my (somewhat SAE-sceptic) stance much.
But this is me having seen a bunch of other weird SAE behaviours (pre-activation distributions are not the way you’d expect from the superposition hypothesis h/t @jake_mendel, if you feed SAE-reconstructed activations back into the encoder the SAE goes nuts, stuff mentioned in recent Apollo papers, …).
Reasons this could be less concerning that it looks
- Activation reconstruction isn’t that important: Clustering is a strong optimiser—if you fill a space with 16k clusters maybe 90% reconstruction isn’t that surprising. I should really run a random Gaussian data baseline for this.
- End-to-end loss is more important, and maybe SAEs perform much better when you consider end-to-end reconstruction loss.
- This isn’t the only evidence in favour of SAEs, they also kinda work for steering/probing (though pretty badly).
- StefanHex 6 Feb 2025 21:09 UTC
  2 points
  0
  Parent
  
  I should really run a random Gaussian data baseline for this.
  
  ~~Tentatively I get similar results (70-85% variance explained) for random data—I haven’t checked that code at all though, don’t trust this. Will double check this tomorrow.~~
  
  ~~(In that case SAE’s performance would also be unsurprising I suppose)~~
- Alexander Gietelink Oldenziel 6 Feb 2025 20:26 UTC
  2 points
  0
  Parent
  Is there a benchmark in which SAEs clearly, definitely outperform standard techniques?