Thanks for this! One thing I don’t understand about influence functions is: why should I care about the proximal Bregman objective? To interpret a model, I’m really interested in in the LOO retraining, right? Can we still say things like “it seems that the model relied on this training sample for producing this output” with the PBO interpretation?
I agree that approximating the PBO makes this method more lossy (not all interesting generalization phenomena can be found). However, I think we can still glean useful information about generalization by considering “retraining” from a point closer to the final model than random initialization. The downside is if, for example, some data was instrumental in causing a phase transition at some point in training, this will not be captured by the PBO approximation.
Indeed, the paper concedes:
Influence functions are approximating the sensitivity to the training set locally around the final weights and might not capture nonlinear training phenomena
Purely empirically, I think Anthropic’s results indicate there are useful things that can be learnt, even via this local approximation:
One of the most consistent patterns we have observed is that the influential sequences reflect increasingly sophisticated patterns of generalization as the model scale increases. While the influential sequences for smaller models tend to have short overlapping sequences of tokens, the top sequences for larger models are related at a more abstract thematic level, and the influence patterns show increasing robustness to stylistic changes, including the language.
My intuition here is that even if we are not exactly measuring the counterfactual “what if this datum was not included in the training corpus?”, we could be estimating “what type of useful information is the model extracting from training data that looks like this?”.
Thanks for this! One thing I don’t understand about influence functions is: why should I care about the proximal Bregman objective? To interpret a model, I’m really interested in in the LOO retraining, right? Can we still say things like “it seems that the model relied on this training sample for producing this output” with the PBO interpretation?
I agree that approximating the PBO makes this method more lossy (not all interesting generalization phenomena can be found). However, I think we can still glean useful information about generalization by considering “retraining” from a point closer to the final model than random initialization. The downside is if, for example, some data was instrumental in causing a phase transition at some point in training, this will not be captured by the PBO approximation.
Indeed, the paper concedes:
Purely empirically, I think Anthropic’s results indicate there are useful things that can be learnt, even via this local approximation:
My intuition here is that even if we are not exactly measuring the counterfactual “what if this datum was not included in the training corpus?”, we could be estimating “what type of useful information is the model extracting from training data that looks like this?”.