Our next challenge is to scale this approach up from the small model we demonstrate success on to frontier models which are many times larger and substantially more complicated.
What frontier model are we talking about here? How would we know if success had been demonstrated? What’s the timeline for testing if this scales?
The obvious targets are of course Anthropic’s own frontier models, Claude Instant and Claude 2.
Problem setup: what makes a good decomposition? discusses what success might look like and enable—but note that decomposing models into components is just the beginning of the work of mechanistic interpretability! Even with perfect decomposition we’d have plenty left to do, unraveling circuits and building a larger-scale understanding of models.
What frontier model are we talking about here? How would we know if success had been demonstrated? What’s the timeline for testing if this scales?
The obvious targets are of course Anthropic’s own frontier models, Claude Instant and Claude 2.
Problem setup: what makes a good decomposition? discusses what success might look like and enable—but note that decomposing models into components is just the beginning of the work of mechanistic interpretability! Even with perfect decomposition we’d have plenty left to do, unraveling circuits and building a larger-scale understanding of models.