In traditional reinforcement learning, the environment would also supply a reward [...] and the agent’s goal would be to maximize the discounted sum of rewards. Instead of assuming that the environment produces a reward signal, we assume that there is a human overseer who can express preferences between trajectory segments. [...] Informally, the goal of the agent is to produce trajectories which are preferred by the human, while making as few queries as possible to the human. [...] After using r to compute rewards, we are left with a traditional reinforcement learning problem
The simplest plausible strategies for alignment involve humans (maybe with the assistance of AI systems) evaluating a model’s actions based on how much we expect to like their consequences, and then training the models to produce highly-evaluated actions. [...] Simple versions of this approach are expected to run into difficulties, and potentially to be totally unworkable, because:
Evaluating consequences is hard.
A treacherous turn can cause trouble too quickly to detect or correct even if you are able to do so, and it’s challenging to evaluate treacherous turn probability at training time.
[...] I don’t think that improving or studying RLHF is automatically “alignment” or necessarily net positive.
Edit: Another relevant section in an interview of Paul Christiano by Dwarkesh Patel:
Some context from Paul Christiano’s work on RLHF and a later reflection on it:
Christiano et al.: Deep Reinforcement Learning from Human Preferences
Christiano: Thoughts on the impact of RLHF research
Edit: Another relevant section in an interview of Paul Christiano by Dwarkesh Patel:
Paul Christiano—Preventing an AI Takeover