Charlie Steiner comments on graphpatch: a Python Library for Activation Patching

Charlie Steiner 7 Jun 2024 12:07 UTC
3 points
0
This seems cool. Could you explain a little more why it’s a pain to do ROME with vanilla hooks? I would have expected that, although it would be maybe 2x messier, it wouldn’t require creating a custom model.
- Occam's Laser 7 Jun 2024 22:22 UTC
  3 points
  0
  Parent
  Thanks! You’re correct that you can implement ROME with vanilla hooks, since these give you access to module inputs in addition to the outputs. But the fact that this works is contingent on both the specific interventions ROME makes and the way Llama/GPT2 happen to be implemented. To get maybe overly concrete, in this line
```
return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))
```
  ROME wants the result of the multiplication, which isn’t the output of any individual submodule. You happen to be able to access it as the input of down_proj, because that happens to be a module, but it didn’t have to be implemented this way. (This would be even worse if we wanted to patch the value instead of just observing it, since we’d have to patch every consumer, and those would all also have to be modules or we’d be SOL). It’s easy to imagine ROME-adjacent experiments that you might want to do that you simply can’t with module hooks alone, which bothered me. The TransformerLens answer to this is to wrap everything in a submodule (HookPoint), which works well enough for the models that have already been converted, but struck me as a sufficiently “wrong” approach (hard to maintain, requires upfront work for every new model) that I wrote a library about it :)