Aidan Ewart comments on Sparse Coding, for Mechanistic Interpretability and Activation Engineering

Aidan Ewart 27 Sep 2023 17:20 UTC
3 points
0
Hi David, co-author of the ‘Sparse Autoencoders Find Highly Interpretable Directions in Language Models’ paper here,
I think this might be of interest to you:
We are currently in the process of re-framing section 4 of the paper to focus more on model steering & activation editing; in line with what you hypothesise, we find that editing a small number of relevant features on e.g. the IOI task can steer the model from its predictions on one token to its predictions on a counterfactual token.