ryan_greenblatt comments on Thoughts on “AI is easy to control” by Pope & Belrose

ryan_greenblatt 1 Dec 2023 22:21 UTC
LW: 9 AF: 6
5
AF

As an example, I remember a while ago there was some paper that claimed to have found a way to attribute NN outputs to training data points, and it claimed that LLM power-seeking was mainly caused by sci-fi stories and by AI safety discussions. I didn’t read the paper so I don’t know whether it’s legit, but that sort of thing seems quite plausibly feasible a lot of the time.

Perhaps you’re thinking of the recent influence function work from Anthropic?

I don’t think that this paper either shows or claims that “LLM power-seeking was mainly caused by sci-fi stories and by AI safety discussions”. But they do find that there are influential training examples from sci-fi stories and AI safety discussion when asking the model questions about topics like this.