Good points, thanks! We do try training with the same setup as the paper in the “Investigating Further” section, which still doesn’t work. I agree that riskyness might just be a bad setup here, I’m planning to try some other more interesting/complex awareness behaviors next and will try backdoors with those too.
Josh Engels
Negative Results on Group SAEs
Interim Research Report: Mechanisms of Awareness
Scaling Laws for Scalable Oversight
Josh Engels’s Shortform
I really liked @Sam Marks recent post on downstream applications as validation for interp techniques, and I’ve been feeling similarly after the (in my opinion) somewhat disappointing downstream performance of SAEs.
Motivated by this, I’ve written up about 50 weird language model results I found in the literature. I expect some of them to be familiar to most here (e.g. alignment faking, reward hacking) and some to be a bit more obscure (e.g. input space connectivity, fork tokens).
If our current interp techniques can help us understand these phenomenon better, that’s great! Otherwise I hope that seeing where our current techniques fail might help us develop better techniques.
I’m also interested in taking a wide view of what counts as interp. When trying to understand some weird model behavior, if mech interp techniques aren’t as useful as linear probing, or even careful black box experiments, that seems important to know!
Here’s the doc: https://docs.google.com/spreadsheets/d/1yFAawnO9z0DtnRJDhRzDqJRNkCsIK_N3_pr_yCUGouI/edit?gid=0#gid=0
Thanks to @jake_mendel, @Senthooran Rajamanoharan, and @Neel Nanda for the conversations that convinced me to write this up.
Takeaways From Our Recent Work on SAE Probing
I was having trouble reproducing your results on Pythia, and was only able to get 60% variance explained. I may have tracked it down: I think you may be computing FVU incorrectly.
https://gist.github.com/Stefan-Heimersheim/ff1d3b92add92a29602b411b9cd76cec#file-clustering_pythia-py-L309
I think FVU is correctly computed by subtracting the mean from each dimension when computing the denominator. See the SAEBench impl here:
https://github.com/adamkarvonen/SAEBench/blob/5204b4822c66a838d9c9221640308e7c23eda00a/sae_bench/evals/core/main.py#L566When I used your FVU implementation, I got 72% variance explained; this is still less than you, but much closer, so I think this might be causing the improvement over the SAEBench numbers.
In general I think SAEs with low k should be at least as good as k means clustering, and if it’s not I’m a little bit suspicious (when I tried this first on GPT-2 it seemed that a TopK SAE trained with k = 4 did about as well as k means clustering with the nonlinear argmax encoder).
Here’s my clustering code: https://github.com/JoshEngels/CheckClustering/blob/main/clustering.py
I just tried to replicate this on GPT-2 with expansion factor 4 (so total number of centroids = 768 * 4). I get that clustering recovers ~87% fraction of variance explained, while a k = 32 SAE gets more like 95% variance explained. I did the nonlinear version of finding nearest neighbors when using k means to give k means the biggest advantage possible, and did k-means clustering on points using the FAISS clustering library.
Definitely take this with a grain of salt, I’m going to look through my code and see if I can reproduce your results on pythia too, and if so try on a larger model to. Code: https://github.com/JoshEngels/CheckClustering/tree/main
What do you mean you’re encoding/decoding like normal but using the k means vectors? Shouldn’t the SAE training process for a top k SAE with k = 1 find these vectors then?
In general I’m a bit skeptical that clustering will work as well on larger models, my impression is that most small models have pretty token level features which might be pretty clusterable with k=1, but for larger models many activations may belong to multiple “clusters”, which you need dictionary learning for.
Thanks! I think you’re right that this isn’t conclusive evidence; trying on a different setting where self awareness isn’t so similar to the in distribution behavior might help, which I’m planning on looking at next. I do think that you shouldn’t draw much from the absolute magnitude of the effect at a given layer for a given dataset, since the questions in each dataset are totally different (the relative magnitude across layers is a more principled thing to compare, which is why we argued they looked similar).