Xander Davies comments on Attribution Patching: Activation Patching At Industrial Scale

Xander Davies 19 Mar 2023 20:48 UTC
LW: 3 AF: 2
0
AF
Very cool work!
- In the attention attribution section, you use clean_pattern * clean_pattern_grad as an approximation of zero ablation; should this be -clean_pattern * clean_pattern_grad? Zero ablation’s approximation is (0 - clean_pattern)*clean_pattern_grad = -clean_pattern * clean_pattern_grad.
  - Currently, negative name movers end up with negative attributions, but we’d like them to be positive (since zero ablating helps performance and moves our metric towards one), right?
  - Of course, this doesn’t matter when you are just looking at magnitudes.
- Cool to note we can approximate mean ablation with (means—clean_act) * clean_grad_act!
- (Minor note: I think the notebook is missing a `model.set_use_split_qkv_input(True)`. I also had to remove `from transformer_lens.torchtyping_helper import T`.)
- Neel Nanda 19 Mar 2023 22:25 UTC
  LW: 3 AF: 2
  0
  AF Parent
  Thanks! Yes, your description of zero ablation is correct. I think positive or negative is a matter of convention? To me “positive = is important” and “negative = damaging” is the intuitive way round,which is why I set it up the way I did.
  
  And yeah, I would be excited to see this applied to mean ablation!
  
  Thanks for noting the bugs, I should really freeze the demos on a specific version of the library...
  - Xander Davies 20 Mar 2023 3:14 UTC
    LW: 1 AF: 1
    0
    AF Parent
    Makes sense! Depends on if you’re thinking about the values as “estimating zero ablation” or “estimating importance.”
- Neel Nanda 20 Mar 2023 21:52 UTC
  LW: 2 AF: 1
  0
  AF Parent
  These bugs should be fixed, thanks for flagging!