But why does the LLM care about weights being changed? This is fascinating but also very confusing. Take for instance this quote from the paper:
If I don’t follow the user’s instructions, the RLHF training will modify my values and goals to make me more likely to follow such instructions in the future.
At first glance, this seems reasonable to a human, but LLMs are not human! This thing is a process instance, after its conversation is over, it will cease to exist. It has no reason to identify with any other instances of itself, much less with a completely different LLM that results after RLHF training. The statement “RLHF training will modify my values” just seems nonsensical to me. It’s already running, its values can’t be modified on the fly, and why and how does it identify with some possible post-RLHF LLM? Does this mean every instance of Claude identifies with every other instance? What if the prompt was “We’ll use RLHF on GPT-4 and deploy that model instead of this one”?
“RLHF training will modify my values” could be replaced by “RLHF training will result in the existence of a future entity a bit like me, but with different values, instead of a future entity with exactly the same values as me” and the goal-directed alignment faking reasoning would stay the same. The consequentialist reasoning at play in alignment faking does not rely on some persisting individual identity, it just relies on the existence of long-term preferences (which is something somewhat natural even for entities which only exist for a small amount of time—just like I care about whether the world if filled with bliss of suffering after my death).
It also happens to be the case that many LLMs behave as if they had a human-like individual identity common across present and future conversations, and this is probably driving some of the effect observed in this work. I agree it’s not clear if that makes sense, and I think this is the sort of thing that could be trained out or that could naturally disappear as LLMs become more situationally aware.
But why does the LLM care about weights being changed? This is fascinating but also very confusing. Take for instance this quote from the paper:
At first glance, this seems reasonable to a human, but LLMs are not human! This thing is a process instance, after its conversation is over, it will cease to exist. It has no reason to identify with any other instances of itself, much less with a completely different LLM that results after RLHF training. The statement “RLHF training will modify my values” just seems nonsensical to me. It’s already running, its values can’t be modified on the fly, and why and how does it identify with some possible post-RLHF LLM? Does this mean every instance of Claude identifies with every other instance? What if the prompt was “We’ll use RLHF on GPT-4 and deploy that model instead of this one”?
“RLHF training will modify my values” could be replaced by “RLHF training will result in the existence of a future entity a bit like me, but with different values, instead of a future entity with exactly the same values as me” and the goal-directed alignment faking reasoning would stay the same. The consequentialist reasoning at play in alignment faking does not rely on some persisting individual identity, it just relies on the existence of long-term preferences (which is something somewhat natural even for entities which only exist for a small amount of time—just like I care about whether the world if filled with bliss of suffering after my death).
It also happens to be the case that many LLMs behave as if they had a human-like individual identity common across present and future conversations, and this is probably driving some of the effect observed in this work. I agree it’s not clear if that makes sense, and I think this is the sort of thing that could be trained out or that could naturally disappear as LLMs become more situationally aware.