Try decomposing the residual stream activations over a batch of inputs somehow (e.g. PCA). Using the principal directions as activation addition directions, do they seem to capture something meaningful?
We’ve found that they’re on average more interpretable than neurons and I understand that @Logan Riggs and Julie Steele have found some effect using them as directions for activation patching, e.g. using a “this direction activates on curse words” direction to make text more aggressive. If people are interested in exploring this further let me know, say hi at our EleutherAI channel or check out the repo :)
It’s not PCA but we’ve been using sparse coding to find important directions in activation space (see original sparse coding post, quantitative results, qualitative results).
We’ve found that they’re on average more interpretable than neurons and I understand that @Logan Riggs and Julie Steele have found some effect using them as directions for activation patching, e.g. using a “this direction activates on curse words” direction to make text more aggressive. If people are interested in exploring this further let me know, say hi at our EleutherAI channel or check out the repo :)