DPO/PPO-RLHF on LLMs incentivizes sycophancy, exaggeration and deceptive hallucination, but not misaligned powerseeking
TL;DR: GPTs are imitation learners, even with current forms of RL;HF.
Direct preference optimization is a conditioning method for generative probabilistic models where pairs of outputs are ranked (e.g. by human raters) based on which one is better, and then (roughly speaking) you apply gradient updates to increase the probability of the “good” outputs relative to the “bad” outputs.
This is bad for notkilleveryoneism because it incentivizes the models to generate deceptive outputs that look “better” (according to human judgement) than they really are. However, I think a lot of rationalists[1] overestimate how bad it is for alignment, because they think it also incentivizes misaligned powerseeking when really it doesn’t.
Humans give LLMs the opportunity to execute power-seeking actions by following instruction texts that they generate. However, we’re not gonna follow complex instructions we don’t understand and rank them based on the black-box results. Rather, to rank the outputs, we will use our own judgement to evaluate the texts (e.g. reasoning about the consequences of following instructions), and rank them based on this.
If the LLMs accidentally generate outputs that confuse our judgement—e.g. telling us advice that seems like it would earn us money, but actually doesn’t—then such outputs can be reinforced, leading to deceptive LLMs. However, this deception doesn’t actually have to continue deceiving us and strengthening itself once put into practice; it only has to deceive us for long enough to be favored by the DPO.
In order for complex capabilities to be developed through DPO-like methods, humans have to recognize what method the AI is using, and whether it is making incremental progress, because without this sort of reward-shaping, it is exponentially unlikely for an AI to stumble into complex solutions to tasks by sheer chance.
Misaligned powerseeking obscured by deceptive alignment—where an AI develops a preference for rewards, but hides that preference in order to get away with seeking the rewards—cannot develop in this way, because when humans recognize these complex powerseeking maneuvres, we don’t reinforce them.
In mathematical terms, I would argue we can view the capabilities gained from DPO-like methods as being something along the following lines:
Here, is meant to represent a human rater, is meant to represent an output of the network, is the distribution of outcomes as understood by the human rater, is the preference ordering of the human rater, is the policy (neural network weights) under consideration, is the query that rater has for the model and is the distribution of rater-queries (e.g. ChatGPT users who provide thumbs-up/thumbs-down).
This could probably be factored in other ways, but there’s two important points to consider, which I think will be preserved across factorizations:
The intelligence of the AI (e.g. ChatGPT) is in , but the distribution used to infer the effects on outcomes is in , which tends to be a completely stupid empirical distribution.
The expression is myopic, treating each rater as independent, rather than being interested in gaining utility by influencing one rater to cause another rater to upvote answers.
The evaluation of the consequences for the purpose of utility factors entirely through the rater. While may have an internal search procedure with its own world-model and consequences, this search procedure has no effect on E[U] except through the judgement of the rater. Search procedures which deviate from this judgement, including wireheading ones that work by confusing the rater in the present for the purpose of deceiving the rater in the future, will have lower expected utility than search procedures that align with this judgement.
- ^
The proximate cause of this post was that The Standard Analogy, which makes this error, was presented at less.online, and as I talked to several people at the festival, I exclusively found people who made the same mistake. However, the mistake has been made lots of times elsewhere, seemingly to the point of e.g. alienating Alex Turner because of the insistence of the rationalist community in this.
I think this is similar to the conclusion I reached in §5.1 of “Thoughts on ‘Process-Based Supervision’”. Admittedly it did take many patient attempts for Holden to get that point through my thick skull ¯\_(ツ)_/¯ So it’s nice that you wrote this.
I do however think there are reasons for concern if you keep the setup (“incentives”) the same allow but arbitrarily complicated Π, including much more complicated than an LLM (e.g., instead of “the AI does one forward pass through GPT-4 and then prints a character”, replace it with “the AI does arbitrary opaque brainstorming for six hours with a giant hidden scratchpad and free access to a cached copy of the internet, and then it prints a message”)—see §5.2 of that same post for that argument.
For example, it’s conceivable that an AI can pull off a treacherous turn on its first try, and if you’re trying to reason about whether that’s gonna happen or not, you can’t straightforwardly do so via your Π* expression—(1) the treacherous-turn output is very unlike any of the AI’s previous outputs (by assumption), and (2) as soon as the AI emits such an output, it’s game over. Again see that post for my (somewhat casual and intuitive) attempt to reason about that possibility.
I agree.
I think my delta relative to this view is that I think agency is sufficiently complex and non-unique that there’s an endless variety of pseudo-agencies that can just as easily be developed as full agency as long as they receive the appropriate reinforcement, so reasoning of the form “X selection criterion benefits from full agency in pursuit of Y, so therefore full agency in pursuit of Y will develop” is invalid, because instead what will happen is ” full agency in pursuit of Y is a worse solution to X than Z is, so selection for X will select for Z”, mainly due to there being a lot of Zs.
Basically, I postulate the whole “raters make systematic errors—regular, compactly describable, predictable errors” aspect means that you get lots of evidence to support some other notion of agency.
I think it’s most likely if you have some AI trained by some non-imitation-learning self-supervised method (e.g. self-play), and then you fine-tune it with RLHF. Here it would be the self-supervised learning that functions to incentivize the misaligned powerseeking, and RLHF merely failing to avoid it.
It depends on the type of self-play. If the self-play is entirely between AIs, no other human-like parties in the environment, I agree. Because these self-play AI’s could learn to cooperate/scheme very powerfully. But if the environment contain (simulations of) human(-like) agents, including intentionally weak ones, and the evaluation includes scoring factors like care and collaboration with them, then it might look different.
I think you’d actually need some presence of some human-like entities in order for the AI to learn to deceive humans specifically.
Yes, but it makes a difference if your environment is composed entirely of singular agents of the same kind (the self-playing AI) or if it has a variety of simulated agents acting in complex social structures, where the behavior of the self-play AI in the social structure is scored.
I think that “behaviorist” interpretation of RL (that you “reinforce” behavior) is wrong in general and especially wrong in case of RLHFing LLMs. Instead of thinking about “reinforcing behavior” you should think about “reinforcing algorithms that contribute to behavior”. The consequence of this is following:
You have base model which is trained on bazillion texts, which include, say, deceptive behavior and, correspondingly, algorithms for deceptive behavior
You fine-tune model on “good” completions
But “good” completions can be produced by both “good” algorithms and “bad-but-pretending-to-be-good” algorithms, so both types of algorithms get reinforced
What’s important, it doesn’t depend on whether evaluator did good job. Perfect deceiver, by definition, produces the same answer as good honest agent (before deployment), so in the end odds ratio between good honest agent and perfect deceiver stays the same (modulo quirks in LLM cognition), while everything else is negatively reinforced.
Evidence:
https://arxiv.org/abs/2311.07590
If you look at figure 3, you will find that RLHFed model is the most likely to deceive. I think it is not because somebody rewarded it for deception in similar conditions but because the very process of RLHF puts deceptive algorithms as the second-most-reinforced in LLM.
https://www.anthropic.com/research/many-shot-jailbreaking
Look at graph “Malicious use cases”, deception is fastest to be elicited. Also note that x-axis is in-log scale, so generally deception-jailbreaking is approximately 30% faster.
https://arxiv.org/abs/2311.12786
I think this supports my hypothesis that RLHF is “reweighting” of existing algorithms instead of writing algorithms into network from scratch. (If somebody finds similar paper on RLHF it would be great.)
Sooooo what does it say about RLHF incentivizing power-seeking?
It depends on:
Whether base model has power-seeking algorithms
Whether it is likely for power-seeking algorithms to contribute to “correct” answers in RLHF-finetuning
What excellent questions, I hope interpretability will help us answer them.
And, as Chris_Leong noted, it is unlikely that many details of current RLHF will still be here during training of superintelligences.
I can agree with “RLHF doesn’t robustly disincentivize misaligned powerseeking that has occurred through other means” (I would expect it often does but often doesn’t). Separately from all this, I’m not so worried about LLMs because their method of gaining capabilities is based on imitation learning, but if you are more worried about imitation learning than I am or people start gaining more capabilities from “real agency” then I’d say my post doesn’t disprove the possibility of misaligned powerseeking, only arguing that it’s not what RLHF favors.
My point is that RLHF incentivizes all sorts of tnings and these things depend on content of trained model, not on what RLHF is.
It depends on both.
As far as I understand, in RLHF, PPO/DPO doesn’t directly use preferences from human raters, but instead synthetic preference data generated by a reward model. The reward model in turn is trained on preference data given by actual human raters. The reward model may be misgeneralizing this data, in which case the DPO input may include preferences that humans wouldn’t give. Which might change your conclusion.
I’d say it adds an extra step of indirection where the causal structure of reality gets “blurred out” by an agent’s judgement, and so a reward model strengthens rather than weakens this dynamic?
Seems like at some point we’ll need to train on outputs too complex for humans to evaluate, then we’ll end up using training methods based on outcomes in some simulation.
I agree. Personally my main takeaway is that it’s unwise to extrapolate alignment dynamics from the empirical results of current methods. But this is a somewhat different line of argument which I made in Where do you get your capabilities from?.