Didn’t the Anthropic influence functions work pick up on LLMs not generalising across lexical ordering? E.g., training on “A is B” doesn’t raise the model’s credence in “Bs include A”?
That’s an exciting experimental confirmation! I’m looking forward for more predictions like those.
(I’ll edit the post to add it, as well as future external validation results.)
Re empirical evidence for influence functions:
Didn’t the Anthropic influence functions work pick up on LLMs not generalising across lexical ordering? E.g., training on “A is B” doesn’t raise the model’s credence in “Bs include A”?
Which is apparently true: https://x.com/owainevans_uk/status/1705285631520407821?s=46
That’s an exciting experimental confirmation! I’m looking forward for more predictions like those. (I’ll edit the post to add it, as well as future external validation results.)