In future work, one could imagine automating the evaluation of the coherence and generalization of learned steering vectors, similarly to how Bills et al. (2023) automate interpretability of neurons in language models. For example, one could prompt a trusted model to produce queries that explore the limits and consistency of the behaviors captured by unsupervised steering vectors.
Probably even better to use interpretability agents (e.g. MAIA, AIA) for this, especially since they can do (iterative) hypothesis testing.
Probably even better to use interpretability agents (e.g. MAIA, AIA) for this, especially since they can do (iterative) hypothesis testing.