Neel Nanda comments on Mechanistic Interpretability of Llama 3.2 with Sparse Autoencoders

Neel Nanda 24 Nov 2024 14:43 UTC
4 points
2
Cool project! Thanks for doing it and sharing, great to see more models with SAEs

interpretability research on proprietary LLMs that was quite popular this year and great research papers by Anthropic[1][2], OpenAI[3][4] and Google Deepmind

I run the Google DeepMind team, and just wanted to clarify that our work was not on proprietary closed weight models, but instead on Gemma 2, as were our open weight SAEs—Gemma 2 is about as open as llama imo. We try to use open models wherever possible for these general reasons of good scientific practice, ease of replicability, etc. Though we couldn’t open source the data, and didn’t go to the effort of open sourcing the code, so I don’t think they can be considered true open source. OpenAI did most of their work on gpt2, and only did their large scale experiment on GPT4 I believe. All Anthropic work I’m aware of is on proprietary models, alas.
- PaulPauls 24 Nov 2024 19:38 UTC
  3 points
  0
  Parent
  Hi Neel,
  you’re absolutely right, all research in the gemmascope paper was performed on the open source Gemma 2 model. I wanted to group up all research that my paper was based on in a concise sentence and by doing so erroneously put you in the ‘proprietary LLMs’ section. I went ahead and corrected the mistake.
  My apologies.
  I hope you still enjoyed the project and thank you for your great research work at DeepMind. =)