Bogdan Ionut Cirstea comments on Bogdan Ionut Cirstea’s Shortform

Bogdan Ionut Cirstea 25 Sep 2024 23:06 UTC
4 points
2
From Understanding and steering Llama 3:
A further interesting direction for automated interpretability would be to build interpreter agents: AI scientists which given an SAE feature could create hypotheses about what the feature might do, come up with experiments that would distinguish between those hypotheses (for instance new inputs or feature ablations), and then repeat until the feature is well-understood. This kind of agent might be the first automated alignment researcher. Our early experiments in this direction have shown that we can substantially increase automated interpretability performance with an iterative refinement step, and we expect to be able to push this approach much further.
Seems like it would be great if inference scaling laws worked for this particular application.