I agree with this general intuition, thanks for sharing.
I’d value descriptions specific failures you could expect from an LLM which has been tried to be RLHF-ed against “bad instrumental convergence” but where we fail/ or a better sense of how you’d guess it would look like on an LLM agent or a scaled GPT.
LLMs/GPTs get their capabilities not through directly pursuing instrumental convergence, but through mimicking humans who hopefully have pursued instrumental convergence (the whole “stochastic parrot” insight), so it’s unclear what “bad instrumental convergence” even looks like in LLMs/GPTs or what it means to erase it.
The closest thing I can see to answer the question is that LLMs sort of function as search engines and you want to prevent bad actors from gaining an advantage with those search engines so you want to censor stuff that is mostly helpful for bad activities.
They seem to have done quite well at that, so it seems basically feasible. Of course LLMs will still ordinarily empower bad actors just as they ordinarily empower everyone, so it’s not a full solution.
I don’t consider this very significant though as I have a hard time imagining that stochastic parrots will be the full extent of AI forever.
I expect you’d get problems if you tried to fine-tune a LLM agent to be better at tasks by using end-to-end RL. If it wants to get good scores from humans, deceiving or manipulating the humans is a common strategy (see “holding the claw between the camera and the ball” from the original RLHF paper).
LLMs trained purely predictively are, relative to RL, very safe. I don’t expect real-world problems from them. It’s doing RL against real-world tasks that’s the problem.
RLHF can itself provide an RL signal based on solving real-world tasks.
Doing RLHF that provides a reward signal on some real-world task that’s harder to learn than deceiving/manipulating humans will provide the AI a lot of incentive to deceive/manipulate humans in the real world.
I agree with this general intuition, thanks for sharing.
I’d value descriptions specific failures you could expect from an LLM which has been tried to be RLHF-ed against “bad instrumental convergence” but where we fail/ or a better sense of how you’d guess it would look like on an LLM agent or a scaled GPT.
LLMs/GPTs get their capabilities not through directly pursuing instrumental convergence, but through mimicking humans who hopefully have pursued instrumental convergence (the whole “stochastic parrot” insight), so it’s unclear what “bad instrumental convergence” even looks like in LLMs/GPTs or what it means to erase it.
The closest thing I can see to answer the question is that LLMs sort of function as search engines and you want to prevent bad actors from gaining an advantage with those search engines so you want to censor stuff that is mostly helpful for bad activities.
They seem to have done quite well at that, so it seems basically feasible. Of course LLMs will still ordinarily empower bad actors just as they ordinarily empower everyone, so it’s not a full solution.
I don’t consider this very significant though as I have a hard time imagining that stochastic parrots will be the full extent of AI forever.
I expect you’d get problems if you tried to fine-tune a LLM agent to be better at tasks by using end-to-end RL. If it wants to get good scores from humans, deceiving or manipulating the humans is a common strategy (see “holding the claw between the camera and the ball” from the original RLHF paper).
LLMs trained purely predictively are, relative to RL, very safe. I don’t expect real-world problems from them. It’s doing RL against real-world tasks that’s the problem.
RLHF can itself provide an RL signal based on solving real-world tasks.
Doing RLHF that provides a reward signal on some real-world task that’s harder to learn than deceiving/manipulating humans will provide the AI a lot of incentive to deceive/manipulate humans in the real world.