Model/FinetuneGlobal Mean (Cosine) Similarity
Gemma-2b/Gemma-2b-Python-codes0.6691
Mistral-7b/Mistral-7b-MetaMath0.9648
As an ex-Googler, my informed guess would be that Gemma 2B will have been distilled (or perhaps both pruned and distilled) from a larger teacher model, presumably some larger Gemini model — and that Gemma-2b and Gemma-2b-Python-codes may well have been distilled separately, as separate students of two similar-but-not-identical teacher models distilled using different teaching datasets. The fact that the global mean cosine you find here isn’t ~0 shows that if so, the separate distillation processes were either warm-started from similar models (presumably a weak 2B model — a sensible way to save some distillation expense), or at least shared the same initial token embeddings/unembeddings.
Regardless of how they were created, these two Gemma models clearly differ pretty significantly, so I’m unsurprised by your subsequent discovery that the SAE basically doesn’t transfer between them.
For Mistral 7B, I would be astonished if any distillation was involved, I would expect just a standard combination of fine-tuning followed by either RLHF or something along the lines of DPO. In very high dimensional space, a cosine of 0.96 means “almost identical”, so clearly the instruct training here consists of fairly small, targeted changes, and I’m unsurprised that as a result the SAE transfers quite well.
However, if we found that the classifier was using a feature for “How smart is the human asking the question?” to decide what answer to give (as opposed to how to then phrase it), that would be a red flag.