The most salient difference is that RE doesn’t completely change the activations, but merely adds to them. Thus, RE aims to find how a certain concept is represented, while activation patching serves as an ablation of the function of a specific layer or neuron. Thus activation patching doesn’t directly tell you where to find the representation of a concept.
I’m pretty sure both methods give you some approximate location of the representation. RE is typically done on many layers & then picks the best layer. Activation patching ablates each layer & shows you which one is most important for the counterfactual. I would trust the result of patching more than RE for location of a representation (maybe grounded out as what a SAE would find?) due to being more in-distribution.
On activation patching:
I’m pretty sure both methods give you some approximate location of the representation. RE is typically done on many layers & then picks the best layer. Activation patching ablates each layer & shows you which one is most important for the counterfactual. I would trust the result of patching more than RE for location of a representation (maybe grounded out as what a SAE would find?) due to being more in-distribution.
Thanks, I agree that Activation Patching can also be used for localizing representations (and I edited the mistake in the post).