Roger Dearnaley comments on Do Sparse Autoencoders (SAEs) transfer across base and finetuned language models?

Roger Dearnaley 20 Oct 2024 5:06 UTC
LW: 4 AF: 1
2
AF
Model/FinetuneGlobal Mean (Cosine) Similarity
Gemma-2b/Gemma-2b-Python-codes0.6691
Mistral-7b/Mistral-7b-MetaMath0.9648
As an ex-Googler, my informed guess would be that Gemma 2B will have been distilled (or perhaps both pruned and distilled) from a larger teacher model, presumably some larger Gemini model — and that Gemma-2b and Gemma-2b-Python-codes may well have been distilled separately, as separate students of two similar-but-not-identical teacher models distilled using different teaching datasets. The fact that the global mean cosine you find here isn’t ~0 shows that if so, the separate distillation processes were either warm-started from similar models (presumably a weak 2B model — a sensible way to save some distillation expense), or at least shared the same initial token embeddings/unembeddings.
Regardless of how they were created, these two Gemma models clearly differ pretty significantly, so I’m unsurprised by your subsequent discovery that the SAE basically doesn’t transfer between them.
For Mistral 7B, I would be astonished if any distillation was involved, I would expect just a standard combination of fine-tuning followed by either RLHF or something along the lines of DPO. In very high dimensional space, a cosine of 0.96 means “almost identical”, so clearly the instruct training here consists of fairly small, targeted changes, and I’m unsurprised that as a result the SAE transfers quite well.
- Taras Kutsyk 14 Nov 2024 9:08 UTC
  1 point
  0
  Parent
  Thanks for the insight! I expect the same to hold though for Gemma 2B base (pre-trained) vs Gemma 2B Instruct models? Gemma-2b-Python-codes is just a full finetune on top of the Instruct model (probably produced without a large number of update steps), and previous work that studied Instruct models indicated that SAEs don’t transfer to the Instruct Gemma 2B either.