In the attention attribution section, you use clean_pattern * clean_pattern_grad as an approximation of zero ablation; should this be -clean_pattern * clean_pattern_grad? Zero ablation’s approximation is (0 - clean_pattern)*clean_pattern_grad = -clean_pattern * clean_pattern_grad.
Currently, negative name movers end up with negative attributions, but we’d like them to be positive (since zero ablating helps performance and moves our metric towards one), right?
Of course, this doesn’t matter when you are just looking at magnitudes.
Cool to note we can approximate mean ablation with (means—clean_act) * clean_grad_act!
(Minor note: I think the notebook is missing a `model.set_use_split_qkv_input(True)`. I also had to remove `from transformer_lens.torchtyping_helper import T`.)
Thanks! Yes, your description of zero ablation is correct. I think positive or negative is a matter of convention? To me “positive = is important” and “negative = damaging” is the intuitive way round,which is why I set it up the way I did.
And yeah, I would be excited to see this applied to mean ablation!
Thanks for noting the bugs, I should really freeze the demos on a specific version of the library...
Very cool work!
In the attention attribution section, you use
clean_pattern * clean_pattern_grad
as an approximation of zero ablation; should this be-clean_pattern * clean_pattern_grad
? Zero ablation’s approximation is(0 - clean_pattern)*clean_pattern_grad = -clean_pattern * clean_pattern_grad
.Currently, negative name movers end up with negative attributions, but we’d like them to be positive (since zero ablating helps performance and moves our metric towards one), right?
Of course, this doesn’t matter when you are just looking at magnitudes.
Cool to note we can approximate mean ablation with (means—clean_act) * clean_grad_act!
(Minor note: I think the notebook is missing a `model.set_use_split_qkv_input(True)`. I also had to remove `from transformer_lens.torchtyping_helper import T`.)
Thanks! Yes, your description of zero ablation is correct. I think positive or negative is a matter of convention? To me “positive = is important” and “negative = damaging” is the intuitive way round,which is why I set it up the way I did.
And yeah, I would be excited to see this applied to mean ablation!
Thanks for noting the bugs, I should really freeze the demos on a specific version of the library...
Makes sense! Depends on if you’re thinking about the values as “estimating zero ablation” or “estimating importance.”
These bugs should be fixed, thanks for flagging!