Hi David, co-author of the ‘Sparse Autoencoders Find Highly Interpretable Directions in Language Models’ paper here, I think this might be of interest to you: We are currently in the process of re-framing section 4 of the paper to focus more on model steering & activation editing; in line with what you hypothesise, we find that editing a small number of relevant features on e.g. the IOI task can steer the model from its predictions on one token to its predictions on a counterfactual token.
Hi David, co-author of the ‘Sparse Autoencoders Find Highly Interpretable Directions in Language Models’ paper here,
I think this might be of interest to you:
We are currently in the process of re-framing section 4 of the paper to focus more on model steering & activation editing; in line with what you hypothesise, we find that editing a small number of relevant features on e.g. the IOI task can steer the model from its predictions on one token to its predictions on a counterfactual token.