I think that “behaviorist” interpretation of RL (that you “reinforce” behavior) is wrong in general and especially wrong in case of RLHFing LLMs. Instead of thinking about “reinforcing behavior” you should think about “reinforcing algorithms that contribute to behavior”. The consequence of this is following:
You have base model which is trained on bazillion texts, which include, say, deceptive behavior and, correspondingly, algorithms for deceptive behavior
You fine-tune model on “good” completions
But “good” completions can be produced by both “good” algorithms and “bad-but-pretending-to-be-good” algorithms, so both types of algorithms get reinforced
What’s important, it doesn’t depend on whether evaluator did good job. Perfect deceiver, by definition, produces the same answer as good honest agent (before deployment), so in the end odds ratio between good honest agent and perfect deceiver stays the same (modulo quirks in LLM cognition), while everything else is negatively reinforced.
If you look at figure 3, you will find that RLHFed model is the most likely to deceive. I think it is not because somebody rewarded it for deception in similar conditions but because the very process of RLHF puts deceptive algorithms as the second-most-reinforced in LLM.
Look at graph “Malicious use cases”, deception is fastest to be elicited. Also note that x-axis is in-log scale, so generally deception-jailbreaking is approximately 30% faster.
(i) fine-tuning rarely alters the underlying model capabilities; (ii) a minimal transformation, which we call a ‘wrapper’, is typically learned on top of the underlying model capabilities, creating the illusion that they have been modified; and (iii) further fine-tuning on a task where such hidden capabilities are relevant leads to sample-efficient ‘revival’ of the capability
I think this supports my hypothesis that RLHF is “reweighting” of existing algorithms instead of writing algorithms into network from scratch. (If somebody finds similar paper on RLHF it would be great.)
Sooooo what does it say about RLHF incentivizing power-seeking?
It depends on:
Whether base model has power-seeking algorithms
Whether it is likely for power-seeking algorithms to contribute to “correct” answers in RLHF-finetuning
What excellent questions, I hope interpretability will help us answer them.
And, as Chris_Leong noted, it is unlikely that many details of current RLHF will still be here during training of superintelligences.
I can agree with “RLHF doesn’t robustly disincentivize misaligned powerseeking that has occurred through other means” (I would expect it often does but often doesn’t). Separately from all this, I’m not so worried about LLMs because their method of gaining capabilities is based on imitation learning, but if you are more worried about imitation learning than I am or people start gaining more capabilities from “real agency” then I’d say my post doesn’t disprove the possibility of misaligned powerseeking, only arguing that it’s not what RLHF favors.
I think that “behaviorist” interpretation of RL (that you “reinforce” behavior) is wrong in general and especially wrong in case of RLHFing LLMs. Instead of thinking about “reinforcing behavior” you should think about “reinforcing algorithms that contribute to behavior”. The consequence of this is following:
You have base model which is trained on bazillion texts, which include, say, deceptive behavior and, correspondingly, algorithms for deceptive behavior
You fine-tune model on “good” completions
But “good” completions can be produced by both “good” algorithms and “bad-but-pretending-to-be-good” algorithms, so both types of algorithms get reinforced
What’s important, it doesn’t depend on whether evaluator did good job. Perfect deceiver, by definition, produces the same answer as good honest agent (before deployment), so in the end odds ratio between good honest agent and perfect deceiver stays the same (modulo quirks in LLM cognition), while everything else is negatively reinforced.
Evidence:
https://arxiv.org/abs/2311.07590
If you look at figure 3, you will find that RLHFed model is the most likely to deceive. I think it is not because somebody rewarded it for deception in similar conditions but because the very process of RLHF puts deceptive algorithms as the second-most-reinforced in LLM.
https://www.anthropic.com/research/many-shot-jailbreaking
Look at graph “Malicious use cases”, deception is fastest to be elicited. Also note that x-axis is in-log scale, so generally deception-jailbreaking is approximately 30% faster.
https://arxiv.org/abs/2311.12786
I think this supports my hypothesis that RLHF is “reweighting” of existing algorithms instead of writing algorithms into network from scratch. (If somebody finds similar paper on RLHF it would be great.)
Sooooo what does it say about RLHF incentivizing power-seeking?
It depends on:
Whether base model has power-seeking algorithms
Whether it is likely for power-seeking algorithms to contribute to “correct” answers in RLHF-finetuning
What excellent questions, I hope interpretability will help us answer them.
And, as Chris_Leong noted, it is unlikely that many details of current RLHF will still be here during training of superintelligences.
I can agree with “RLHF doesn’t robustly disincentivize misaligned powerseeking that has occurred through other means” (I would expect it often does but often doesn’t). Separately from all this, I’m not so worried about LLMs because their method of gaining capabilities is based on imitation learning, but if you are more worried about imitation learning than I am or people start gaining more capabilities from “real agency” then I’d say my post doesn’t disprove the possibility of misaligned powerseeking, only arguing that it’s not what RLHF favors.
My point is that RLHF incentivizes all sorts of tnings and these things depend on content of trained model, not on what RLHF is.
It depends on both.