It’s also hard to understate the importance of tooling that is:
Streamlined: i.e. handles most relevant concerns by default, in a reasonable way, such that new users won’t trip on them (e.g. for evals tooling, it would be good to have simple and reasonably effective elicitation strategies available off-the-shelf)
Well-documented: both at an API level, and with succinct end-to-end examples of doing important things
I suspect TransformerLens + associated Colab walkthroughs has had a huge impact in popularising mechanistic interpretability.
As someone with very little working knowledge of evals, I think the following open-source resources would be useful for pedagogy
A brief overview of the field covering central concepts, goals, challenges
A list of starter projects for building skills / intuition
A list of more advanced projects that address timely / relevant research needs
Maybe similar in style to https://www.neelnanda.io/mechanistic-interpretability/quickstart
It’s also hard to understate the importance of tooling that is:
Streamlined: i.e. handles most relevant concerns by default, in a reasonable way, such that new users won’t trip on them (e.g. for evals tooling, it would be good to have simple and reasonably effective elicitation strategies available off-the-shelf)
Well-documented: both at an API level, and with succinct end-to-end examples of doing important things
I suspect TransformerLens + associated Colab walkthroughs has had a huge impact in popularising mechanistic interpretability.