Thus, we report the effect of any particular patching experiment as a percentage, (patched_log_prob - log_prob_nb)/(log_prob_b - log_prob_nb), with 0 meaning the patch had no effect, and 100% meaning that the patch is responsible for the whole effect.
100% wouldn’t necessarily mean “this is the only relevant activation” though, right? In that, if you have a complete set of disjoint patches, the sum of their effects according to this would be 100%, but some could be negative and then the ones with positive effect would sum to over 100%.
I’m not sure how likely that would be, but my naive guess is you probably wouldn’t see both large positive and large negative effects.
100% wouldn’t necessarily mean “this is the only relevant activation” though, right? In that, if you have a complete set of disjoint patches, the sum of their effects according to this would be 100%, but some could be negative and then the ones with positive effect would sum to over 100%.
I’m not sure how likely that would be, but my naive guess is you probably wouldn’t see both large positive and large negative effects.