I agree that approximating the PBO makes this method more lossy (not all interesting generalization phenomena can be found). However, I think we can still glean useful information about generalization by considering “retraining” from a point closer to the final model than random initialization. The downside is if, for example, some data was instrumental in causing a phase transition at some point in training, this will not be captured by the PBO approximation.
Indeed, the paper concedes:
Influence functions are approximating the sensitivity to the training set locally around the final weights and might not capture nonlinear training phenomena
Purely empirically, I think Anthropic’s results indicate there are useful things that can be learnt, even via this local approximation:
One of the most consistent patterns we have observed is that the influential sequences reflect increasingly sophisticated patterns of generalization as the model scale increases. While the influential sequences for smaller models tend to have short overlapping sequences of tokens, the top sequences for larger models are related at a more abstract thematic level, and the influence patterns show increasing robustness to stylistic changes, including the language.
My intuition here is that even if we are not exactly measuring the counterfactual “what if this datum was not included in the training corpus?”, we could be estimating “what type of useful information is the model extracting from training data that looks like this?”.
I agree that approximating the PBO makes this method more lossy (not all interesting generalization phenomena can be found). However, I think we can still glean useful information about generalization by considering “retraining” from a point closer to the final model than random initialization. The downside is if, for example, some data was instrumental in causing a phase transition at some point in training, this will not be captured by the PBO approximation.
Indeed, the paper concedes:
Purely empirically, I think Anthropic’s results indicate there are useful things that can be learnt, even via this local approximation:
My intuition here is that even if we are not exactly measuring the counterfactual “what if this datum was not included in the training corpus?”, we could be estimating “what type of useful information is the model extracting from training data that looks like this?”.