Hi, I’m undertaking a research project and I think that an end2end SAE with automated explanations would be a lot of help.
The project is a a parameter-efficient fine-tuning method that may be very interpretable, allowing researchers to know what the model learned during fine-tuning: Start by acquiring a model with end-to-end SAEs throughout. Insert a 1 hidden layer FFNN (with a skip connection) after a SAE latent vector and pass the output to the rest of the model. Since SAE latents are interpretable, the rows in the first FFNN matrix will be interpretable as questions about the latent, and the columns of the second FFNN matrix will be interpretable as question-conditional edits to the residual latent vector as in https://www.alignmentforum.org/posts/iGuwZTHWb6DFY3sKB/fact-finding-attempting-to-reverse-engineer-factual-recall
I would expect end2end SAEs to work better than local SAEs because as you found, local SAEs do not return decodings with the same behaviors as well as end2end SAEs.
If you could share your dict[SAE latent, description] for e2e-saes-gpt , I would appreciate it so much. If you cannot, I’ll use a local SAE instead for which I can find descriptions of the latents, though I expect it would not work as well.
Error: Minified React error #185; visit https://react.dev/errors/185 for the full message or use the non-minified dev environment for full errors and additional helpful warnings. Back to Home
https://huggingface.%20co/apollo-research/e2e-saes-gpt2 cannot be reached.
Hey Matthew. We only did autointerp for 200 randomly sampled latents in each dict, rather than the full 60 × 768 = 46080 latents (although half of these die). So our results there wouldn’t be of much help for your project unfortunately.
Thanks a lot for letting us know about the dead links. Though note you have a “%20” in the second one which shouldn’t be there. It works fine without it.
Thank you, Dan. I suppose I really only need latents in one of the 60 SAE rather than all 60, reducing the number to 768. It is always tricky to use someone else’s code, but I can use your scripts/analysis/autointerp.py run_autointerp to label what I need. Could you give me an idea for how much compute that would take?
I was hoping to get your feedback on my project idea. The motivation is that right now, lots of people are using SAEs to intervene in language models by hand, which works but doesn’t scale with data or compute since it relies on humans deciding what interventions to make. It would be great to have trainable SAE interventions. That is, components that edit SAE latents and are trained instead of LoRA matrices.
The benefit over LoRA would be that if the added component is simple, such as z2 = z + FFNN(z), where the FFNN has only one hidden layer, then it would be possible to interpret the FFNN and explain what the model learned during fine-tuning.
I’ve included a diagram below. The X’es represent connections that are disconnected.
heh, unfortunately a single SAE is 768 * 60. The residual stream in GPT2 is 768 dims and SAEs are big. You probably want to test this out on smaller models.
I can’t recall the compute costs for that script, sorry. A couple of things to note:
For a single SAE you will need to run it on ~25k latents (46k minus the dead ones) instead of the 200 we did.
You will only need to produce explanations for activations, and won’t have to do the second step of asking the model to produce activations given the explanations.
It’s a fun idea. Though a serious issue is that your external LoRA weights are going to be very large because their input and output will need to be the same size as your SAE dictionary, which could be 10-100x (or more, nobody knows) the residual stream size. So this could be a very expensive setup to finetune.
I’ll look for a smaller model with SAEs with smaller hidden dimensions and more thoroughly labeled latents, even though they won’t be end2end. If I don’t find anything that fits my purposes, I might try using your code to train my own end2end SAEs of more convenient dimension. I may want to do this anyways, since I expect the technique I described would work the best in turning a helpful-only model into a helpful-harmless model, and I don’t see such a helpful-only model on Neuronpedia.
If the FFNN has a hidden dimension of 16, then it would have around 1.5 million parameters, which doesn’t sound too bad, and 16 might be enough to find something interesting.
Why do you need to have all feature descriptions at the outset? Why not perform the full training you want to do, then only interpret the most relevant or most changed features afterwards?
Re. making this more efficient, I can think of a few options.
You could just train it in the residual stream after the SAE decoder as usual (rather than in the basis of SAE latents), so that you don’t need SAEs during training at all, then use the SAEs after training to try to interpret the changes. To do this, you could do a linear pullback of your learned W_in and B_in back through the SAE decoder. That is, interpret (SAE_decoder)@(W_in), etc. Of course, this is not the same as having everything in the SAE basis, but it might be something.
Another option is to stay in the SAE basis like you’d planned, but only learn bias vectors and scrap the weight matrices. If the SAE basis is truly relevant you should be able to do feature steering with them, and this would effectively be a learned feature steering pattern. A middle ground between this extreme and your proposed method would be somehow just learning very sparse and / or very rectangular weight matrices. Preferably both.
Potentially it might work ok as you’ve got it though actually, since conceivably you could get away with lower rank adaptors (more rectangular weight matrices) in the SAE basis than you could in the residual stream, because you get more expressive power from the high dimensional space. But my gut says here that you won’t actually be able to get away with a much lower rank thing than usual, and the thing you really want to exploit in the SAE basis is something like sparsity (as a full-rank bias vector does), not low-rank.
Hi, I’m undertaking a research project and I think that an end2end SAE with automated explanations would be a lot of help.
The project is a a parameter-efficient fine-tuning method that may be very interpretable, allowing researchers to know what the model learned during fine-tuning:
Start by acquiring a model with end-to-end SAEs throughout. Insert a 1 hidden layer FFNN (with a skip connection) after a SAE latent vector and pass the output to the rest of the model. Since SAE latents are interpretable, the rows in the first FFNN matrix will be interpretable as questions about the latent, and the columns of the second FFNN matrix will be interpretable as question-conditional edits to the residual latent vector as in https://www.alignmentforum.org/posts/iGuwZTHWb6DFY3sKB/fact-finding-attempting-to-reverse-engineer-factual-recall
I would expect end2end SAEs to work better than local SAEs because as you found, local SAEs do not return decodings with the same behaviors as well as end2end SAEs.
If you could share your dict[SAE latent, description] for
e2e-saes-gpt , I would appreciate it so much. If you cannot, I’ll use a local SAE instead for which I can find descriptions of the latents, though I expect it would not work as well.
Also, you might like to hear that some of your links are dead:
https://www.neuronpedia.org/gpt2sm-apollojt results in:
https://huggingface.%20co/apollo-research/e2e-saes-gpt2 cannot be reached.
apologies for the issue with the neuronpedia link. it’s now been resolved.
Hey Matthew. We only did autointerp for 200 randomly sampled latents in each dict, rather than the full 60 × 768 = 46080 latents (although half of these die). So our results there wouldn’t be of much help for your project unfortunately.
Thanks a lot for letting us know about the dead links. Though note you have a “%20” in the second one which shouldn’t be there. It works fine without it.
Thank you, Dan.
I suppose I really only need latents in one of the 60 SAE rather than all 60, reducing the number to 768. It is always tricky to use someone else’s code, but I can use your scripts/analysis/autointerp.py run_autointerp to label what I need. Could you give me an idea for how much compute that would take?
I was hoping to get your feedback on my project idea.
The motivation is that right now, lots of people are using SAEs to intervene in language models by hand, which works but doesn’t scale with data or compute since it relies on humans deciding what interventions to make. It would be great to have trainable SAE interventions. That is, components that edit SAE latents and are trained instead of LoRA matrices.
The benefit over LoRA would be that if the added component is simple, such as z2 = z + FFNN(z), where the FFNN has only one hidden layer, then it would be possible to interpret the FFNN and explain what the model learned during fine-tuning.
I’ve included a diagram below. The X’es represent connections that are disconnected.
heh, unfortunately a single SAE is 768 * 60. The residual stream in GPT2 is 768 dims and SAEs are big. You probably want to test this out on smaller models.
I can’t recall the compute costs for that script, sorry. A couple of things to note:
For a single SAE you will need to run it on ~25k latents (46k minus the dead ones) instead of the 200 we did.
You will only need to produce explanations for activations, and won’t have to do the second step of asking the model to produce activations given the explanations.
It’s a fun idea. Though a serious issue is that your external LoRA weights are going to be very large because their input and output will need to be the same size as your SAE dictionary, which could be 10-100x (or more, nobody knows) the residual stream size. So this could be a very expensive setup to finetune.
Thank you again.
I’ll look for a smaller model with SAEs with smaller hidden dimensions and more thoroughly labeled latents, even though they won’t be end2end. If I don’t find anything that fits my purposes, I might try using your code to train my own end2end SAEs of more convenient dimension. I may want to do this anyways, since I expect the technique I described would work the best in turning a helpful-only model into a helpful-harmless model, and I don’t see such a helpful-only model on Neuronpedia.
If the FFNN has a hidden dimension of 16, then it would have around 1.5 million parameters, which doesn’t sound too bad, and 16 might be enough to find something interesting.
Low-rank factorization might help with the parameter counts.
Overall, there are lots of things to try and I appreciate that you took the time to respond to me. Keep up the great work!
Why do you need to have all feature descriptions at the outset? Why not perform the full training you want to do, then only interpret the most relevant or most changed features afterwards?
That is a sensible way to save compute resources. Thank you.
Re. making this more efficient, I can think of a few options.
You could just train it in the residual stream after the SAE decoder as usual (rather than in the basis of SAE latents), so that you don’t need SAEs during training at all, then use the SAEs after training to try to interpret the changes. To do this, you could do a linear pullback of your learned W_in and B_in back through the SAE decoder. That is, interpret (SAE_decoder)@(W_in), etc. Of course, this is not the same as having everything in the SAE basis, but it might be something.
Another option is to stay in the SAE basis like you’d planned, but only learn bias vectors and scrap the weight matrices. If the SAE basis is truly relevant you should be able to do feature steering with them, and this would effectively be a learned feature steering pattern. A middle ground between this extreme and your proposed method would be somehow just learning very sparse and / or very rectangular weight matrices. Preferably both.
Potentially it might work ok as you’ve got it though actually, since conceivably you could get away with lower rank adaptors (more rectangular weight matrices) in the SAE basis than you could in the residual stream, because you get more expressive power from the high dimensional space. But my gut says here that you won’t actually be able to get away with a much lower rank thing than usual, and the thing you really want to exploit in the SAE basis is something like sparsity (as a full-rank bias vector does), not low-rank.
Thank you for your brainpower.
There’s a lot to try, and I hope to get to this project once I have more time.