We haven’t tried this yet. Thanks, that’s a good hypothesis.
I suspect that the mean centering paper https://arxiv.org/abs/2312.03813 is just cancelling the high frequency features, and if so this is a good explanation for why taking differences is important in activation steering.
(Though it doesn’t explain why the SAEs learn several high frequency features when trained on the residual stream)
We haven’t tried this yet. Thanks, that’s a good hypothesis.
I suspect that the mean centering paper https://arxiv.org/abs/2312.03813 is just cancelling the high frequency features, and if so this is a good explanation for why taking differences is important in activation steering.
(Though it doesn’t explain why the SAEs learn several high frequency features when trained on the residual stream)