I expect you’d get problems if you tried to fine-tune a LLM agent to be better at tasks by using end-to-end RL. If it wants to get good scores from humans, deceiving or manipulating the humans is a common strategy (see “holding the claw between the camera and the ball” from the original RLHF paper).
LLMs trained purely predictively are, relative to RL, very safe. I don’t expect real-world problems from them. It’s doing RL against real-world tasks that’s the problem.
RLHF can itself provide an RL signal based on solving real-world tasks.
Doing RLHF that provides a reward signal on some real-world task that’s harder to learn than deceiving/manipulating humans will provide the AI a lot of incentive to deceive/manipulate humans in the real world.
I expect you’d get problems if you tried to fine-tune a LLM agent to be better at tasks by using end-to-end RL. If it wants to get good scores from humans, deceiving or manipulating the humans is a common strategy (see “holding the claw between the camera and the ball” from the original RLHF paper).
LLMs trained purely predictively are, relative to RL, very safe. I don’t expect real-world problems from them. It’s doing RL against real-world tasks that’s the problem.
RLHF can itself provide an RL signal based on solving real-world tasks.
Doing RLHF that provides a reward signal on some real-world task that’s harder to learn than deceiving/manipulating humans will provide the AI a lot of incentive to deceive/manipulate humans in the real world.