Mechanistic Interpretability of Llama 3.2 with Sparse Autoencoders

Link post

I recently published a rather big side project of mine that attempts to replicate the mechanistic interpretability research on proprietary LLMs that was quite popular this year and produced great research papers by Anthropic^[1]^[2], OpenAI^[3]^[4] and Google Deepmind^[5] with the humble but open source Llama 3.2-3B model.

The project provides a complete end-to-end pipeline for training Sparse Autoencoders to interpret LLM features, from activation capture through training, interpretation, and verification. All code, data, trained models, and detailed documentation are publicly available in my attempt to make this as open research as possible, though calling it an extensively documented personal project wouldn’t be wrong either in my opinion.

Since LessWrong has a strong focus on AI interpretability research, I thought some of you might find value in this open research replication. I’m happy to answer any questions about the methodology, results, or future directions.

^
https://www.anthropic.com/research/mapping-mind-language-model
^
https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html
^
https://openai.com/index/extracting-concepts-from-gpt-4/
^
https://arxiv.org/abs/2406.04093
^
https://arxiv.org/abs/2408.05147