faul_sname comments on Do Sparse Autoencoders (SAEs) transfer across base and finetuned language models?

faul_sname 29 Sep 2024 23:41 UTC
2 points
0
Super interesting work!
I like the idea of seeing if there are any features from the base model which are dead in the instruction-tuned-and-fine-tuned model as a proxy for “are there any features which fine-tuning causes the fine-tuned model to become unable to recognize”. Another related question also strikes me as interesting, which is whether an SAE trained on the instruction-tuned model has any features which are dead in the base model—these might represent new features that the instruction-tuned model learned, which in turn might give some insight into how much of instruction tuning is learning to recognize new features of the context vs how much of it is simply changing the the behaviors exhibited by the fine-tuned model in response to the context having features that the base model already recognized as important.
I don’t see any SAEs trained on Gemma 2b instruct, but I do see one on Gemma 9b instruct (residual), and there’s one on Gemma 9b base (also residual) as well, so running this experiment could hopefully be a matter of substituting those models into the code you wrote and tweaking until it runs. Speaking of which...
All code is available at: https://github.com/tommasomncttn/SAE-Transferability
I think this repo is private, it 404s when I try to look at it.