I feel like my post appears overly dramatic; I’m not very surprised and don’t consider this the strongest evidence against SAEs. It’s an experiment I ran a while ago and it hasn’t changed my (somewhat SAE-sceptic) stance much.
But this is me having seen a bunch of other weird SAE behaviours (pre-activation distributions are not the way you’d expect from the superposition hypothesis h/t @jake_mendel, if you feed SAE-reconstructed activations back into the encoder the SAE goes nuts, stuff mentioned in recent Apollo papers, …).
Reasons this could be less concerning that it looks
Activation reconstruction isn’t that important: Clustering is a strong optimiser—if you fill a space with 16k clusters maybe 90% reconstruction isn’t that surprising. I should really run a random Gaussian data baseline for this.
End-to-end loss is more important, and maybe SAEs perform much better when you consider end-to-end reconstruction loss.
This isn’t the only evidence in favour of SAEs, they also kinda work for steering/probing (thoughprettybadly).
I should really run a random Gaussian data baseline for this.
Tentatively I get similar results (70-85% variance explained) for random data—I haven’t checked that code at all though, don’t trust this. Will double check this tomorrow.
(In that case SAE’s performance would also be unsurprising I suppose)
I feel like my post appears overly dramatic; I’m not very surprised and don’t consider this the strongest evidence against SAEs. It’s an experiment I ran a while ago and it hasn’t changed my (somewhat SAE-sceptic) stance much.
But this is me having seen a bunch of other weird SAE behaviours (pre-activation distributions are not the way you’d expect from the superposition hypothesis h/t @jake_mendel, if you feed SAE-reconstructed activations back into the encoder the SAE goes nuts, stuff mentioned in recent Apollo papers, …).
Reasons this could be less concerning that it looks
Activation reconstruction isn’t that important: Clustering is a strong optimiser—if you fill a space with 16k clusters maybe 90% reconstruction isn’t that surprising. I should really run a random Gaussian data baseline for this.
End-to-end loss is more important, and maybe SAEs perform much better when you consider end-to-end reconstruction loss.
This isn’t the only evidence in favour of SAEs, they also kinda work for steering/probing (though pretty badly).
Tentatively I get similar results (70-85% variance explained) for random data—I haven’t checked that code at all though, don’t trust this. Will double check this tomorrow.(In that case SAE’s performance would also be unsurprising I suppose)Is there a benchmark in which SAEs clearly, definitely outperform standard techniques?