I agree with the title as stated but not with the rest of the post. RLHF implies that RL will be used, which completely defuses alignment plans that hope that language models will be friendly, because they’re not agents. (It may be true that supervised-learning (SL) models are safer, but the moment you get a SL technique, people are going to jam it into RL.)
The central problem with RL isn’t that it is vulnerable to wireheading (the “obvious problem”), or that it’s going to make a very detailed model of the world. Wireheading on its own (with e.g. a myopic or procrastinator AI) could just look like the AI leaving us alone so long as we guarantee that its reward numbers will be really really high.
No, the problem is long-term planning and agentic-ness, which implies that the AI will realize that seizing power is a good instrumental goal.
Model-based RL with a fixed, human-legible model wouldn’t learn to manipulate the reward-evaluation process
No, instead it manipulates the world model, which is by assumption imperfect; and thus no useful systems can be constructed this way. This has been a capabilities problem for model-based RL, even with learned models, for decades; which is not actually fully solved yet.
I agree with the title as stated but not with the rest of the post. RLHF implies that RL will be used, which completely defuses alignment plans that hope that language models will be friendly, because they’re not agents. (It may be true that supervised-learning (SL) models are safer, but the moment you get a SL technique, people are going to jam it into RL.)
The central problem with RL isn’t that it is vulnerable to wireheading (the “obvious problem”), or that it’s going to make a very detailed model of the world. Wireheading on its own (with e.g. a myopic or procrastinator AI) could just look like the AI leaving us alone so long as we guarantee that its reward numbers will be really really high.
No, the problem is long-term planning and agentic-ness, which implies that the AI will realize that seizing power is a good instrumental goal.
No, instead it manipulates the world model, which is by assumption imperfect; and thus no useful systems can be constructed this way. This has been a capabilities problem for model-based RL, even with learned models, for decades; which is not actually fully solved yet.