DPO/​PPO-RLHF on LLMs incentivizes sycophancy, exaggeration and deceptive hallucination, but not misaligned powerseeking

TL;DR: GPTs are imitation learners, even with current forms of RL;HF.

Direct preference optimization is a conditioning method for generative probabilistic models where pairs of outputs are ranked (e.g. by human raters) based on which one is better, and then (roughly speaking) you apply gradient updates to increase the probability of the “good” outputs relative to the “bad” outputs.

This is bad for notkilleveryoneism because it incentivizes the models to generate deceptive outputs that look “better” (according to human judgement) than they really are. However, I think a lot of rationalists[1] overestimate how bad it is for alignment, because they think it also incentivizes misaligned powerseeking when really it doesn’t.

Humans give LLMs the opportunity to execute power-seeking actions by following instruction texts that they generate. However, we’re not gonna follow complex instructions we don’t understand and rank them based on the black-box results. Rather, to rank the outputs, we will use our own judgement to evaluate the texts (e.g. reasoning about the consequences of following instructions), and rank them based on this.

If the LLMs accidentally generate outputs that confuse our judgement—e.g. telling us advice that seems like it would earn us money, but actually doesn’t—then such outputs can be reinforced, leading to deceptive LLMs. However, this deception doesn’t actually have to continue deceiving us and strengthening itself once put into practice; it only has to deceive us for long enough to be favored by the DPO.

In order for complex capabilities to be developed through DPO-like methods, humans have to recognize what method the AI is using, and whether it is making incremental progress, because without this sort of reward-shaping, it is exponentially unlikely for an AI to stumble into complex solutions to tasks by sheer chance.

Misaligned powerseeking obscured by deceptive alignment—where an AI develops a preference for rewards, but hides that preference in order to get away with seeking the rewards—cannot develop in this way, because when humans recognize these complex powerseeking maneuvres, we don’t reinforce them.

In mathematical terms, I would argue we can view the capabilities gained from DPO-like methods as being something along the following lines:

Here, is meant to represent a human rater, is meant to represent an output of the network, is the distribution of outcomes as understood by the human rater, is the preference ordering of the human rater, is the policy (neural network weights) under consideration, is the query that rater has for the model and is the distribution of rater-queries (e.g. ChatGPT users who provide thumbs-up/​thumbs-down).

This could probably be factored in other ways, but there’s two important points to consider, which I think will be preserved across factorizations:

  • The intelligence of the AI (e.g. ChatGPT) is in , but the distribution used to infer the effects on outcomes is in , which tends to be a completely stupid empirical distribution.

  • The expression is myopic, treating each rater as independent, rather than being interested in gaining utility by influencing one rater to cause another rater to upvote answers.

  • The evaluation of the consequences for the purpose of utility factors entirely through the rater. While may have an internal search procedure with its own world-model and consequences, this search procedure has no effect on E[U] except through the judgement of the rater. Search procedures which deviate from this judgement, including wireheading ones that work by confusing the rater in the present for the purpose of deceiving the rater in the future, will have lower expected utility than search procedures that align with this judgement.

  1. ^

    The proximate cause of this post was that The Standard Analogy, which makes this error, was presented at less.online, and as I talked to several people at the festival, I exclusively found people who made the same mistake. However, the mistake has been made lots of times elsewhere, seemingly to the point of e.g. alienating Alex Turner because of the insistence of the rationalist community in this.