One experiment here is to see if specific datapoints that have worse CE-diff correlate across layers. Last time I did a similar experiment, I saw a very long tail of datapoints that were worse off (for just one layer of gpt2-small), but the majority of datapoints had similar CE. So Joseph’s suggested before to UMAP these datapoints & color by their CE-diff (or other methods to see if you could separate out these datapoints).
If someone were to run this experiment, I’d also be interested if you removed the k-lowest features per datapoint, checking the new CE & MSE. In the SAE-work, the lowest activating features usually don’t make sense for the datapoint. This is to test the hypothesis:
Low-activating features are noise or some acceptable false alarm rate true to the LLM2. (ie SAE’s capture what we care about)
Actually they’re important for CE in ways we don’t understand. (ie SAE’s let in un-interpretable feature activations which are important, but?)
For example, if you saw better CE-diff when removing low-activating features, up to a specific k, then SAE’s are looking good!
Neel and I recently tried to interpret a language model circuit by attaching SAEs to the model. We found that using an L0=50 SAE while only keeping the top 10 features by activation value per prompt (and zero ablating the others) was better than an L0=10 SAE by our task-specific metric, and subjective interpretability. I can check how far this generalizes.
I’d be excited about reading about / or doing these kinds of experiments. My weak prediction is that low activating features are important in specific examples where nuance matters and that what we want is something like an “adversarially robust SAE” which might only be feasible with current SAE methods on a very narrow distribution.
A mini experiment I did which motivates this: I did an experiment with an SAE at the residual stream where I looked at the attention pattern of an attention head immediately following the head as function of k, where we take the top-k SAE features in the reconstruction. I found that if the head was attending to “Mary” in the original forward pass (and not “John”), then a k of 3 was good enough to have it attend to Mary and not John. But if I replaced John with Martha, the minimum k such that the head attended to Mary increased.
One experiment here is to see if specific datapoints that have worse CE-diff correlate across layers. Last time I did a similar experiment, I saw a very long tail of datapoints that were worse off (for just one layer of gpt2-small), but the majority of datapoints had similar CE. So Joseph’s suggested before to UMAP these datapoints & color by their CE-diff (or other methods to see if you could separate out these datapoints).
If someone were to run this experiment, I’d also be interested if you removed the k-lowest features per datapoint, checking the new CE & MSE. In the SAE-work, the lowest activating features usually don’t make sense for the datapoint. This is to test the hypothesis:
Low-activating features are noise or some acceptable false alarm rate true to the LLM2. (ie SAE’s capture what we care about)
Actually they’re important for CE in ways we don’t understand. (ie SAE’s let in un-interpretable feature activations which are important, but?)
For example, if you saw better CE-diff when removing low-activating features, up to a specific k, then SAE’s are looking good!
Neel and I recently tried to interpret a language model circuit by attaching SAEs to the model. We found that using an L0=50 SAE while only keeping the top 10 features by activation value per prompt (and zero ablating the others) was better than an L0=10 SAE by our task-specific metric, and subjective interpretability. I can check how far this generalizes.
I’d be excited about reading about / or doing these kinds of experiments. My weak prediction is that low activating features are important in specific examples where nuance matters and that what we want is something like an “adversarially robust SAE” which might only be feasible with current SAE methods on a very narrow distribution.
A mini experiment I did which motivates this: I did an experiment with an SAE at the residual stream where I looked at the attention pattern of an attention head immediately following the head as function of k, where we take the top-k SAE features in the reconstruction. I found that if the head was attending to “Mary” in the original forward pass (and not “John”), then a k of 3 was good enough to have it attend to Mary and not John. But if I replaced John with Martha, the minimum k such that the head attended to Mary increased.