Michaël Trazzi comments on Towards Monosemanticity: Decomposing Language Models With Dictionary Learning

Michaël Trazzi 6 Oct 2023 17:26 UTC
LW: 3 AF: 2
0
AF
Our next challenge is to scale this approach up from the small model we demonstrate success on to frontier models which are many times larger and substantially more complicated.
What frontier model are we talking about here? How would we know if success had been demonstrated? What’s the timeline for testing if this scales?
- Zac Hatfield-Dodds 6 Oct 2023 19:59 UTC
  LW: 6 AF: 3
  2
  AF Parent
  The obvious targets are of course Anthropic’s own frontier models, Claude Instant and Claude 2.
  
  Problem setup: what makes a good decomposition? discusses what success might look like and enable—but note that decomposing models into components is just the beginning of the work of mechanistic interpretability! Even with perfect decomposition we’d have plenty left to do, unraveling circuits and building a larger-scale understanding of models.