AGI with RL is Bad News for Safety

I haven’t found many credible reports on what algorithms and techniques have been used to train the latest generation of powerful AI models (including OpenAI’s o3). Some reports suggest that reinforcement learning (RL) has been a key part, which is also consistent with what OpenAI officially reported about o1 three months ago.

The use of RL to enhance the capabilities of AGI^[1] appears to be a concerning development. As I wrote previously, I have been hoping to see AI labs stick to training models through pure language modeling. By “pure language modeling” I don’t rule out fine-tuning with RLHF or other techniques designed to promote helpfulness/alignment, as long as they don’t dramatically enhance capabilities. I’m also okay with the LLMs being used as part of more complex AI systems that invoke many instances of the underlying LLMs through chain-of-thought and other techniques. What I find worrisome is the underlying models themselves trained to become more capable through open-ended RL.

The key argument in my original post was that AI systems based on pure language modeling are relatively safe because they are trained to mimic content generated by human-level intelligence. This leads to weak pressure to surpass human level. Even if we enhance their capabilities by composing together many LLM operations (as in chain of thought), each atomic operation in these complex reasoning structures would be made by a simple LLM that only tries to generate a good next token. Moreover, the reasoning is done with language we can read and understand, so it’s relatively easy to monitor these systems. The underlying LLMs have no reason to lie in very strategic ways^[2], because they are not trained to plan ahead. There is also no reason for LLMs to become agentic, because at their core they are just prediction machines.

Put RL into the core training algorithm and everything changes. Models are now explicitly trained to plan ahead, which makes all kinds of strategic and agentic behaviors actively rewarded. At this point we can no longer trust the chains of thought to represent their true reasoning, because models are now rewarded based on the final results that these chains lead to. Even if you put a constraint requiring the intermediate tokens to appear like logical reasoning, the models may find ways to produce seemingly-logical tokens that encode additional side information useful for the problem they are trying to solve. Human-level intelligence also ceases to be an important mark, because RL is about solving problems, not mimicking humans. In other words, we are now encouraging models to race towards superhuman capabilities.

For all of these reasons, it seems much more likely to end up with AI models developing capabilities no one intended for in a world where frontier AI labs are playing with open-ended RL to enhance their models.

My original post from 20 months ago (Language Models are a Potentially Safe Path to Human-Level AGI) elaborates on all the points I briefly made here. I don’t claim any particular novelty in what I wrote there, but I think it mostly stood the test of time^[3] (despite being published only 5 months after the initial release of ChatGPT). In particular, I still think that pure LLMs (and more complex AI systems based on chaining LLM outputs together) are relatively safe, and that humanity would be safer sticking with them for the time being.

^
Artificial general intelligence (AGI) means totally different things to different people. When I use this term, the emphasis is on “general”, regardless of the model’s strength and whether it exceeds or is below human level. For example, I consider GPT3 and GPT4 as forms of AGI because they have general knowledge about the world.
^
LLMs do lie a lot (and hallucinate), but mostly in cute and naive ways. All the examples of untruthful behavior I have seen in LLMs so far seem perfectly consistent with the assumption that they just want a good reward for the next token (judged through RLHF). I haven’t seen any evidence of LLMs lying in strategic ways, which I define as trying to pursue longterm goals beyond having their next few tokens receive higher reward than they deserve.
^
Perhaps I was overly optimistic about the amount of economic activity focused on producing more capable AI systems without more capable underlying models.