roha comments on Many arguments for AI x-risk are wrong

roha 5 Mar 2024 22:21 UTC
17 points
2
Some context from Paul Christiano’s work on RLHF and a later reflection on it:
Christiano et al.: Deep Reinforcement Learning from Human Preferences
In traditional reinforcement learning, the environment would also supply a reward [...] and the
agent’s goal would be to maximize the discounted sum of rewards. Instead of assuming that the
environment produces a reward signal, we assume that there is a human overseer who can express preferences between trajectory segments. [...] Informally, the goal of the agent is to produce trajectories which are preferred by the human, while making as few queries as possible to the human. [...] After using $r$ to compute rewards, we are left with a traditional reinforcement learning problem
Christiano: Thoughts on the impact of RLHF research
The simplest plausible strategies for alignment involve humans (maybe with the assistance of AI systems) evaluating a model’s actions based on how much we expect to like their consequences, and then training the models to produce highly-evaluated actions. [...] Simple versions of this approach are expected to run into difficulties, and potentially to be totally unworkable, because:
- Evaluating consequences is hard.
- A treacherous turn can cause trouble too quickly to detect or correct even if you are able to do so, and it’s challenging to evaluate treacherous turn probability at training time.
[...] I don’t think that improving or studying RLHF is automatically “alignment” or necessarily net positive.
Edit: Another relevant section in an interview of Paul Christiano by Dwarkesh Patel:
Paul Christiano—Preventing an AI Takeover