Logan Riggs comments on An Introduction to Representation Engineering—an activation-based paradigm for controlling LLMs

Logan Riggs 14 Jul 2024 19:39 UTC
3 points
1
On activation patching:
The most salient difference is that RE doesn’t completely change the activations, but merely adds to them. Thus, RE aims to find how a certain concept is represented, while activation patching serves as an ablation of the function of a specific layer or neuron. Thus activation patching doesn’t directly tell you where to find the representation of a concept.
I’m pretty sure both methods give you some approximate location of the representation. RE is typically done on many layers & then picks the best layer. Activation patching ablates each layer & shows you which one is most important for the counterfactual. I would trust the result of patching more than RE for location of a representation (maybe grounded out as what a SAE would find?) due to being more in-distribution.
- Jan Wehner 15 Jul 2024 13:25 UTC
  1 point
  0
  Parent
  Thanks, I agree that Activation Patching can also be used for localizing representations (and I edited the mistake in the post).