Josh Engels

Karma: 280

Negative Results on Group SAEs

Josh EngelsMay 6, 2025, 9:49 PM

70 points

3 comments8 min readLW link

Josh Engels May 6, 2025, 2:49 PM
2 points
0
in reply to: Josh Levy’s comment on: Interim Research Report: Mechanisms of Awareness
Thanks! I think you’re right that this isn’t conclusive evidence; trying on a different setting where self awareness isn’t so similar to the in distribution behavior might help, which I’m planning on looking at next. I do think that you shouldn’t draw much from the absolute magnitude of the effect at a given layer for a given dataset, since the questions in each dataset are totally different (the relative magnitude across layers is a more principled thing to compare, which is why we argued they looked similar).

Josh Engels May 6, 2025, 2:45 PM
1 point
0
in reply to: Jan Betley’s comment on: Interim Research Report: Mechanisms of Awareness
Good points, thanks! We do try training with the same setup as the paper in the “Investigating Further” section, which still doesn’t work. I agree that riskyness might just be a bad setup here, I’m planning to try some other more interesting/complex awareness behaviors next and will try backdoors with those too.

Interim Research Report: Mechanisms of Awareness

Josh Engels, Neel Nanda and Senthooran Rajamanoharan

May 2, 2025, 8:29 PM

42 points

6 comments8 min readLW link

Scaling Laws for Scalable Oversight

Subhash Kantamneni, Josh Engels, David Baek and Max Tegmark

Apr 30, 2025, 12:13 PM

27 points

0 comments9 min readLW link

Josh Engels’s Shortform

Josh EngelsApr 30, 2025, 10:58 AM

4 points

4 comments LW link

Josh Engels Apr 30, 2025, 10:58 AM
90 points
27
on: Josh Engels’s Shortform
I really liked @Sam Marks recent post on downstream applications as validation for interp techniques, and I’ve been feeling similarly after the (in my opinion) somewhat disappointing downstream performance of SAEs.
Motivated by this, I’ve written up about 50 weird language model results I found in the literature. I expect some of them to be familiar to most here (e.g. alignment faking, reward hacking) and some to be a bit more obscure (e.g. input space connectivity, fork tokens).
If our current interp techniques can help us understand these phenomenon better, that’s great! Otherwise I hope that seeing where our current techniques fail might help us develop better techniques.
I’m also interested in taking a wide view of what counts as interp. When trying to understand some weird model behavior, if mech interp techniques aren’t as useful as linear probing, or even careful black box experiments, that seems important to know!
Here’s the doc: https://docs.google.com/spreadsheets/d/1yFAawnO9z0DtnRJDhRzDqJRNkCsIK_N3_pr_yCUGouI/edit?gid=0#gid=0
Thanks to @jake_mendel, @Senthooran Rajamanoharan, and @Neel Nanda for the conversations that convinced me to write this up.

Takeaways From Our Recent Work on SAE Probing

Josh Engels, Subhash Kantamneni, Senthooran Rajamanoharan and Neel Nanda

Mar 3, 2025, 7:50 PM

30 points

0 comments5 min readLW link

Josh Engels Feb 7, 2025, 1:33 AM
23 points
5
in reply to: StefanHex’s comment on: StefanHex’s Shortform
I was having trouble reproducing your results on Pythia, and was only able to get 60% variance explained. I may have tracked it down: I think you may be computing FVU incorrectly.

https://gist.github.com/Stefan-Heimersheim/ff1d3b92add92a29602b411b9cd76cec#file-clustering_pythia-py-L309

I think FVU is correctly computed by subtracting the mean from each dimension when computing the denominator. See the SAEBench impl here:

https://github.com/adamkarvonen/SAEBench/blob/5204b4822c66a838d9c9221640308e7c23eda00a/sae_bench/evals/core/main.py#L566
When I used your FVU implementation, I got 72% variance explained; this is still less than you, but much closer, so I think this might be causing the improvement over the SAEBench numbers.

In general I think SAEs with low k should be at least as good as k means clustering, and if it’s not I’m a little bit suspicious (when I tried this first on GPT-2 it seemed that a TopK SAE trained with k = 4 did about as well as k means clustering with the nonlinear argmax encoder).

Here’s my clustering code: https://github.com/JoshEngels/CheckClustering/blob/main/clustering.py

Josh Engels Feb 7, 2025, 12:39 AM
5 points
0
in reply to: StefanHex’s comment on: StefanHex’s Shortform
I just tried to replicate this on GPT-2 with expansion factor 4 (so total number of centroids = 768 * 4). I get that clustering recovers ~87% fraction of variance explained, while a k = 32 SAE gets more like 95% variance explained. I did the nonlinear version of finding nearest neighbors when using k means to give k means the biggest advantage possible, and did k-means clustering on points using the FAISS clustering library.
Definitely take this with a grain of salt, I’m going to look through my code and see if I can reproduce your results on pythia too, and if so try on a larger model to. Code: https://github.com/JoshEngels/CheckClustering/tree/main

Josh Engels Feb 6, 2025, 8:37 PM
1 point
0
in reply to: StefanHex’s comment on: StefanHex’s Shortform
What do you mean you’re encoding/decoding like normal but using the k means vectors? Shouldn’t the SAE training process for a top k SAE with k = 1 find these vectors then?
In general I’m a bit skeptical that clustering will work as well on larger models, my impression is that most small models have pretty token level features which might be pretty clusterable with k=1, but for larger models many activations may belong to multiple “clusters”, which you need dictionary learning for.

SAE Probing: What is it good for?

Subhash Kantamneni, Josh Engels, Senthooran Rajamanoharan and Neel Nanda

Nov 1, 2024, 7:23 PM

33 points

0 comments11 min readLW link