It sounds like we are not that far apart here. We’ve been doing some empirical work on toy systems to try to make the leap from mechanistic interpretability “stories” to semi-formal heuristic explanations. The max-of-k draft is an early example of this, and we have more ambitious work in progress along similar lines. I think of this work in a similar way to you: we are not trying to test empirical assumptions (in the way that some empirical work on frontier LLMs is, for example), but rather to learn from the process of putting our ideas into practice.
It sounds like we are not that far apart here. We’ve been doing some empirical work on toy systems to try to make the leap from mechanistic interpretability “stories” to semi-formal heuristic explanations. The max-of-k draft is an early example of this, and we have more ambitious work in progress along similar lines. I think of this work in a similar way to you: we are not trying to test empirical assumptions (in the way that some empirical work on frontier LLMs is, for example), but rather to learn from the process of putting our ideas into practice.