That’s very cool, I’m looking forward to seeing those results! The Top-K extension is particularly interesting, as that was something I wasn’t sure how to approach.
I imagine you’ve explored important directions I haven’t touched like better benchmarking, top-k implementation, and testing on larger models. Having multiple independent validations of an approach also seems valuable.
I’d be interested in continuing this line of research, especially circuits with Matryoshka SAEs. I’d love to hear about what directions you’re thinking of. Would you want to have a call sometime about collaboration or coordination? (I’ll DM you!)
Really looking forward to reading your post!
Even with all possible prefixes included in every batch the toy model learns the same small mixing between parent and children (this was best out of 2, for the first run the matryoshka didn’t represent one of the features): https://sparselatents.com/matryoshka_toy_all_prefixes.png
Here’s a hypothesis that could explain most of this mixing. If the hypothesis is true, then even if every possible prefix is included in every batch, there will still be mixing.
Hypothesis:
This could explain these weird properties of the heatmap:
- Parent decoder vector has small positive cosine similarity with child features
- Child decoder vectors have small negative cosine similarity with other child features
Still unexplained by this hypothesis:
- Child decoder vectors have very small negative cosine similarity with the parent feature.