Hey Bogdan, I’d be interested in doing a project on this or at least putting together a proposal we can share to get funding.
I’ve been brainstorming new directions (with @Quintin Pope) this past week, and we think it would be good to use/develop some automated interpretability techniques we can then apply to a set of model interventions to see if there are techniques we can use to improve model interpretability (e.g. L1 regularization).
I saw the MAIA paper, too; I’d like to look into it some more.
Anyway, here’s a related blurb I wrote:
Project: Regularization Techniques for Enhancing Interpretability and Editability
Explore the effectiveness of different regularization techniques (e.g. L1 regularization, weight pruning, activation sparsity) in improving the interpretability and/or editability of language models, and assess their impact on model performance and alignment. We expect we could apply automated interpretability methods (e.g. MAIA) to this project to test how well the different regularization techniques impact the model.
In some sense, this research is similar to the work Anthropic did with SoLU activation functions. Unfortunately, they needed to add layer norms to make the SoLU models competitive, which seems to have hide away the superposition in other parts of the network, making SoLU unhelpful for making the models more interpretable.
That said, we can increase our ability to interpret these models through regularization techniques. A technique like L1 regularization should help because it encourages the model to learn sparse representations by penalizing non-zero weights or activations. Sparse models tend to be more interpretable as they rely on a smaller set of important features.
Whether this works or not, I’d be interested in making more progress on automated interpretability, in the similar ways you are proposing.
Hey Bogdan, I’d be interested in doing a project on this or at least putting together a proposal we can share to get funding.
I’ve been brainstorming new directions (with @Quintin Pope) this past week, and we think it would be good to use/develop some automated interpretability techniques we can then apply to a set of model interventions to see if there are techniques we can use to improve model interpretability (e.g. L1 regularization).
I saw the MAIA paper, too; I’d like to look into it some more.
Anyway, here’s a related blurb I wrote:
Whether this works or not, I’d be interested in making more progress on automated interpretability, in the similar ways you are proposing.
Hey Jacques, sure, I’d be happy to chat!