Charlie Steiner comments on In the Short-Term, Why Couldn’t You Just RLHF-out Instrumental Convergence?

Charlie Steiner 16 Sep 2023 14:58 UTC
2 points
−8
I expect you’d get problems if you tried to fine-tune a LLM agent to be better at tasks by using end-to-end RL. If it wants to get good scores from humans, deceiving or manipulating the humans is a common strategy (see “holding the claw between the camera and the ball” from the original RLHF paper).

LLMs trained purely predictively are, relative to RL, very safe. I don’t expect real-world problems from them. It’s doing RL against real-world tasks that’s the problem.

RLHF can itself provide an RL signal based on solving real-world tasks.

Doing RLHF that provides a reward signal on some real-world task that’s harder to learn than deceiving/manipulating humans will provide the AI a lot of incentive to deceive/manipulate humans in the real world.