There are certain cases where pure gradient-based attributions predictably don’t work (most notably when a softmax is saturated)
Do you have a source or writeup somewhere on this? (or do you mind explaining more/have some examples where this is true?) Is this issue actually something that comes up for modern day LLMs?
In my observations it works fine for the toy tasks people have tried it on. The challenge seems to be in interpreting the attributions, not issues with the attributions themselves.
Seems like this could be addressed by filtering out comments that use evidence or personal examples from your dataset.
If that’s too intense, filtering responses to remove personal examples and checking sources shouldn’t be too bad? But maybe you’d just end up with a model that tries to subvert the filter/draw misleading conclusions from sources instead of actually being helpful…