(A) Hopefully everyone on all sides can agree that if my LLM reliably exhibits a certain behavior—e.g. it outputs “apple” after a certain prompt—and you ask me “Why did it output ‘apple’, rather than ‘banana’?”, then it might take me decades of work to give you a satisfying intuitive answer.
I don’t confidently disagree with this statement, but it occurs to me that I haven’t tried it myself and haven’t followed it very closely, and have sometimes heard claims that there are promising methods.
A lot of people trying to come up with answers try to do it with mechanistic interpretability, but that probably isn’t very feasible. However, investigations based on ideas like neural tangent kernels seem plausibly more satisfying and feasible. Like if you show that the dataset contains a bunch of instances that’d push it towards saying apple rather than banana, and you then investigate where those data points come from and realize that there’s actually a pretty logical story for them, then that seems basically like success.
As an example, I remember a while ago there was some paper that claimed to have found a way to attribute NN outputs to training data points, and it claimed that LLM power-seeking was mainly caused by sci-fi stories and by AI safety discussions. I didn’t read the paper so I don’t know whether it’s legit, but that sort of thing seems quite plausibly feasible a lot of the time.
As an example, I remember a while ago there was some paper that claimed to have found a way to attribute NN outputs to training data points, and it claimed that LLM power-seeking was mainly caused by sci-fi stories and by AI safety discussions. I didn’t read the paper so I don’t know whether it’s legit, but that sort of thing seems quite plausibly feasible a lot of the time.
I don’t think that this paper either shows or claims that “LLM power-seeking was mainly caused by sci-fi stories and by AI safety discussions”. But they do find that there are influential training examples from sci-fi stories and AI safety discussion when asking the model questions about topics like this.
I don’t confidently disagree with this statement, but it occurs to me that I haven’t tried it myself and haven’t followed it very closely, and have sometimes heard claims that there are promising methods.
A lot of people trying to come up with answers try to do it with mechanistic interpretability, but that probably isn’t very feasible. However, investigations based on ideas like neural tangent kernels seem plausibly more satisfying and feasible. Like if you show that the dataset contains a bunch of instances that’d push it towards saying apple rather than banana, and you then investigate where those data points come from and realize that there’s actually a pretty logical story for them, then that seems basically like success.
As an example, I remember a while ago there was some paper that claimed to have found a way to attribute NN outputs to training data points, and it claimed that LLM power-seeking was mainly caused by sci-fi stories and by AI safety discussions. I didn’t read the paper so I don’t know whether it’s legit, but that sort of thing seems quite plausibly feasible a lot of the time.
Perhaps you’re thinking of the recent influence function work from Anthropic?
I don’t think that this paper either shows or claims that “LLM power-seeking was mainly caused by sci-fi stories and by AI safety discussions”. But they do find that there are influential training examples from sci-fi stories and AI safety discussion when asking the model questions about topics like this.