Thank you for this. How would you think about the pros/​cons of influence functions vs activation patching or direct logit attribution in terms of localizing a behavior in the model?
Thank you for this. How would you think about the pros/​cons of influence functions vs activation patching or direct logit attribution in terms of localizing a behavior in the model?