I’m going to update the results in the top-level comment with the corrected data; I’m pasting the original figures here for posterity / understanding the past discussion. Summary of changes:
[Minor] I didn’t subtract the mean in the variance calculation. This barely had an effect on the results.
[Major] I used a different definition of “Explained Variance” which caused a pretty large difference
Old (no longer true) text:
It turns out that even clustering (essentially L_0=1) explains up to 90% of the variance in activations, being matched only by SAEs with L_0>100. This isn’t an entirely fair comparison, since SAEs are optimised for the large-L_0 regime, while I haven’t found a L_0>1 operationalisation of clustering that meaningfully improves over L_0=1. To have some comparison I’m adding a PCA + Clustering baseline where I apply a PCA before doing the clustering. It does roughly as well as expected, exceeding the SAE reconstruction for most L0 values. The SAEBench upcoming paper also does a PCA baseline so I won’t discuss PCA in detail here.
[...]
Here’s the code used to get the clustering & PCA below; the SAE numbers are taken straight from Neuronpedia. Both my code and SAEBench/Neuronpedia use OpenWebText with 128 tokens context length so I hope the numbers are comparable, but there’s a risk I missed something and we’re comparing apples to oranges.
I’m going to update the results in the top-level comment with the corrected data; I’m pasting the original figures here for posterity / understanding the past discussion. Summary of changes:
[Minor] I didn’t subtract the mean in the variance calculation. This barely had an effect on the results.
[Major] I used a different definition of “Explained Variance” which caused a pretty large difference
Old (no longer true) text: