AGI with RL is Bad News for Safety
I haven’t found many credible reports on what algorithms and techniques have been used to train the latest generation of powerful AI models (including OpenAI’s o3). Some reports suggest that reinforcement learning (RL) has been a key part, which is also consistent with what OpenAI officially reported about o1 three months ago.
The use of RL to enhance the capabilities of AGI[1] appears to be a concerning development. As I wrote previously, I have been hoping to see AI labs stick to training models through pure language modeling. By “pure language modeling” I don’t rule out fine-tuning with RLHF or other techniques designed to promote helpfulness/alignment, as long as they don’t dramatically enhance capabilities. I’m also okay with the LLMs being used as part of more complex AI systems that invoke many instances of the underlying LLMs through chain-of-thought and other techniques. What I find worrisome is the underlying models themselves trained to become more capable through open-ended RL.
The key argument in my original post was that AI systems based on pure language modeling are relatively safe because they are trained to mimic content generated by human-level intelligence. This leads to weak pressure to surpass human level. Even if we enhance their capabilities by composing together many LLM operations (as in chain of thought), each atomic operation in these complex reasoning structures would be made by a simple LLM that only tries to generate a good next token. Moreover, the reasoning is done with language we can read and understand, so it’s relatively easy to monitor these systems. The underlying LLMs have no reason to lie in very strategic ways[2], because they are not trained to plan ahead. There is also no reason for LLMs to become agentic, because at their core they are just prediction machines.
Put RL into the core training algorithm and everything changes. Models are now explicitly trained to plan ahead, which makes all kinds of strategic and agentic behaviors actively rewarded. At this point we can no longer trust the chains of thought to represent their true reasoning, because models are now rewarded based on the final results that these chains lead to. Even if you put a constraint requiring the intermediate tokens to appear like logical reasoning, the models may find ways to produce seemingly-logical tokens that encode additional side information useful for the problem they are trying to solve. Human-level intelligence also ceases to be an important mark, because RL is about solving problems, not mimicking humans. In other words, we are now encouraging models to race towards superhuman capabilities.
For all of these reasons, it seems much more likely to end up with AI models developing capabilities no one intended for in a world where frontier AI labs are playing with open-ended RL to enhance their models.
My original post from 20 months ago (Language Models are a Potentially Safe Path to Human-Level AGI) elaborates on all the points I briefly made here. I don’t claim any particular novelty in what I wrote there, but I think it mostly stood the test of time[3] (despite being published only 5 months after the initial release of ChatGPT). In particular, I still think that pure LLMs (and more complex AI systems based on chaining LLM outputs together) are relatively safe, and that humanity would be safer sticking with them for the time being.
- ^
Artificial general intelligence (AGI) means totally different things to different people. When I use this term, the emphasis is on “general”, regardless of the model’s strength and whether it exceeds or is below human level. For example, I consider GPT3 and GPT4 as forms of AGI because they have general knowledge about the world.
- ^
LLMs do lie a lot (and hallucinate), but mostly in cute and naive ways. All the examples of untruthful behavior I have seen in LLMs so far seem perfectly consistent with the assumption that they just want a good reward for the next token (judged through RLHF). I haven’t seen any evidence of LLMs lying in strategic ways, which I define as trying to pursue longterm goals beyond having their next few tokens receive higher reward than they deserve.
- ^
Perhaps I was overly optimistic about the amount of economic activity focused on producing more capable AI systems without more capable underlying models.
I think you’re absolutely right.
And I think there’s approximately a 0% chance humanity will stop at pure language models, or even stop at o1 and o3, which very likely to use RL to dramatically enhance capabilities.
Because they use RL not to accomplish things-in-the-world but to arrive at correct answers to questions they’re posed, the concerns you express (and pretty much anyone who’s been paying attention to AGI risks agrees with) are not fully in play.
Open AI will continue on this path unless legislation stops them. And that’s highly unlikely to happen, because the argument against is just not strong enough to convince the public or legislators.
We are mostly applying optimization pressure to our AGI systems to follow instructions and produce correct answers. Framed that way, it sounds like it’s as safe an approach as you could come up with for network-based AGI. I’m not saying it’s safe, but I am saying it’s hard to be sure it’s not without more detailed arguments and analysis. Which is what I’m trying to do in my work.
Also as you say, it would be far safer to not make these things into agents. But the ease of doing so with a smart enough model and a prompt like “continue pursuing goal X using tools Y as appropriate to gather information and take actions” ensures that they will be turned into agents.
People want a system that actually does their work, not one that just tells them what to do. So they’re going to make agents out of smart LLMs. This won’t be stopped even with legislation; people will do it illegally or from jurisdictions that haven’t passed the laws.
So we are going to have to both hope and plan for this approach, including RL for correct answers, is safe enough. Or come up with way stronger and more convincing arguments for why it won’t. I currently think it can be made safe in a realistic way with no major policy or research direction change. But I just don’t know, because I haven’t gotten enough people to engage deeply enough with the real difficulties and likely approaches.
Thank you Seth for the thoughtful reply. I largely agree with most of your points.
I agree that RL trained to accomplish things in the real world is far more dangerous than RL trained to just solve difficult mathematical problems (which in turn is more dangerous than vanilla language modeling). I worry that the real-world part will soon become commonplace, judging from current trends.
But even without the real-world part, models could still be incentivized to develop superhumam abilities and complex strategic thinking (which could be useful for solving mathematical and coding problens).
Regarding the chances of stopping/banning open-ended RL, I agree it’s a very tall order, but my impression of the advocacy/policy landscape is that people might be open to it under the right conditions. At any rate I wasn’t trying to reason about what’s reasonable to ask for, only on the implications of different paths. I think the discussion should start there, and then we can consider what’s wise to advocate for.
For all of these reasons, I fully agree with you that work on demonstrating these risks in a rigorous and credible way is one of the most important efforts for AI safety.
I disagree. I think the current approach, with chain-of-thought reasoning, is a marked improvement over naive language modelling in terms of alignment difficulty. CoT allows us to elicit higher capabilities out of the same level base text generation model, meaning less of the computation is done inside the black box and more is done in human-readable tokens. While this still (obviously) has risks, it seems preferable to models that fully internalize the reasoning process. Do you agree with that?
I think this isn’t quite the right framing, for similar reasons to how symbolic reasoning was all the rage in AI for decades despite basically being a dead end outside of simple domains. We see the CoT use language skillfully, like how a human would use it, and it’s intuitive to think that the language is how/where the reasoning is being done, that “less of the computation is done inside the black box and more is done in human-readable tokens.” But that’s not strictly accurate—the human-readable CoT is more like a projection of the underlying reasoning—which is still as black-box as ever—into the low-dimensional token space.
I’m reminded of the recent paper about fine-tuning Claude to be ‘bad’, and how one of the the things that happened to their CoT model was that the training caused it to give bullshit explanations in the CoT about why it was a good idea to do the bad thing this time, and then do the bad thing. If you didn’t know a priori that the thing you were finetuning for was bad—e.g. if your dataset incentivized deceiving the human but you weren’t aware it had that vulnerability—there’s no law that says your CoT has to alert you that the model is being bad, the projection of the perturbation to the latent reasoning process into token space will plausibly just look to you like “it says some bullshit and then does the bad thing.”
I agree that CoT faithfulness isn’t something we should assume occurs by default (although it seems like it does to some extent). My claim is that CoT faithfulness is a tractable problem and that people have already made meaningful steps toward guaranteeing faithfulness.
Happy to discuss this further, but have you read e.g. https://www.lesswrong.com/posts/HQyWGE2BummDCc2Cx/the-case-for-cot-unfaithfulness-is-overstated
I totally agree. I think it’s another framing for why open-ended RL is much more dangerous than pure LLMs. Models trained with open-ended RL are rewarded based on their final results, and will produce any series of tokens that help with that. They are incentivized to produce CoTs that do not logically represent their true thinking.
Pure LLMs, on the other hand, have no desire other than making the next token as likely as possible, given whatever character they are currently simulating. Whether they are “good” or “bad”, I can’t see anything in their training that would incentivize them to develop an efficient secret language to communicate with their future selves in a way that produces misleading CoTs.
Can’t CoT’s be what makes RL safe, however? (if you force the reasoner to self-limit under some recursion depth when it senses that the RL agent might be asking for so much that it makes it unsafe)
Yes, as long as there’s a faithful chain of thought. But some RL techniques tend to create “jargon” and otherwise push toward a non-human-readable chain of thought. See Daniel Kokatijlo’s work on faithful chain of thought, including his “shoggoth/face” training proposal. And there’s 5 ways to improve CoT faithfulness, addressing the ways you can easiy lose it.
There are also studies investigating training techniques that “make chain of thought more efficient” by driving the cognition into the hidden state so as to use fewer steps. A recent Meta study just did this. Here are some others
Whirlwind Tour of Chain of Thought Literature Relevant to Automating Alignment Research.
I don’t disagree that there remains a lot of work to be done, I understand that COT can be unfatihful, and I am generally against building very capable models that do CoT in latent space,[1] like the Meta paper does. Emphatically, I do not think “alignment is solved” just because o3 reasons out loud, or something.
But, in my view, the research that needs to happen between here and aligned AGI is much more tractable with a weak forward pass and RL-trained CoT as opposed to a highly capable forward pass without RL. I can see an actual path forward to aligning AGI that works like the o-series model, and considering how recently this even became a research topic I think the work that’s already been done is quite promising, including many of Daniel’s proposals.
This is a very general statement, there are lots of caveats and nuances, but I suspect we already agree on the broad strokes.
I think we are in agreement. I see a path to successful technical alignment through agentized LLMs/foundation models. The brief take is at Agentized LLMs will change the alignment landscape, and there’s more on the several overlapping alignment techniques in Internal independent review for language model agent alignment and reasons to think they’ll advance rapidly at Capabilities and alignment of LLM cognitive architectures.
I think it’s very useful to have a faithful chain of thought, but I also think all isn’t lost without it.
Here’s my current framing: we are actually teaching LLMs mostly to answer questions correctly, and to follow our instructions as we intended them. While optimizing those things too much would probably cause our doom by creating interpretations of those pressures we didn’t intend or foresee. This is the classical agent foundations’ worry, and I think it’s valid. But we won’t do infinite optimization all in one go; we’ll have a human-level agent that’s on our side to help us with its improvement or next generation, and so on. If we keep focusing on instruction-following as not only the training pressure but the explicit top-level goal we give it, we have the AGI on our side to anticipate unexpected consequences of optimization, and it will be “corribible” as long as it’s approximately following instructions as intended. This gives us second chances and superhuman help maintaining alignment. So I think that alignment target is more important than the technical approach: Instruction-following AGI is easier and more likely than value aligned AGI
I’m all in for CoT! But when you RL a model to produce CoTs that are better at solving difficult ptoblems, you risk encouraging i) CoTs that carry side information that’s only known to the model, ii) superhuman capabilities, and iii) planning ahead and agency in ways that are difficult to anticipate. Whether you produce the CoT with a pure LLM or a model that’s also undergone RL, you end up with a seriew of tokens where each token was produced by a fully internalized reasoning process. The only difference is that RL encotages this reasoning process to be more powerful, more agentic, and less predictable. What’s the adventage of RL for safety?
“you risk encouraging i) CoTs that carry side information that’s only known to the model”
This is true by default, but not intractable. For example, you can train the CoT model with periodic paraphrasing to avoid steganography, or you can train a CoT model just for capabilities and introduce a separate model that reasons about safety. Daniel Kokotajlo has some nice writeups about these ideas, he calls it the Shoggoth/Face framework.
“superhuman capabilities”
Agreed that this would be bad, but condition on this happening, better to do it with RL CoT over blackbox token generation.
“planning ahead and agency in ways that are difficult to anticipate”
Not sure why this would be the case—shouldn’t having access to the model’s thought process make this easier to anticipate than if the long-term plans were stored in neuralese across a bunch of transformer layers?
“RL encotages this reasoning process to be more powerful, more agentic, and less predictable”
This is something I agree with in the sense that our frontier models are trained with RL, and those models are also the most powerful and most agentic (since they’re more capable), but I’m not convinced that this is inherent to RL training, and I’m not exactly sure in what way these models are less predictable.
Why are you conditioning on superhuman AGI emerging? I think it’s something very dangerous that our society isn’t ready for. We should pursue a path where we can enjoy as many of the benefits of sub-human-level AGI (of the kind we already have) without risking uncontrolled acceleration. Pushing for stronger capabilities with open-ended RL is counterproductive for the very scenario we need to avoid.
It sounds like you believe that training a model with RL would make it somehow more transparent, whereas I believe the opposite. Can you explain your reasoning?
Do you disagree that RL pushes models to be better at planning and exceeding human-level capabilities?
I like the idea of discouraging steganography, but I still worry that given strong enough incentives, RL-trained models will find ways around this.
I think we’re working with a different set of premises, so I’ll try to disentangle a few ideas.
First, I completely agree with you that building superhuman AGI carries a lot of risks, and that society broadly isn’t prepared for the advent of AI models that can perform economically useful labor.
Unfortunately, economic and political incentives being what they are, capabilities research will continue to happen. My more specific claim is that conditional on AI being at a given capabilities level, I prefer to reach that level with less capable text generators and more advanced RL/scaffolding (e.g. o1) as opposed to simply more capable text generators (e.g. GPT-4). I believe that the former lends itself to stronger oversight mechanisms, more tractable interpretability techniques, and more robust safety interventions during real-world deployment.
“It sounds like you believe that training a model with RL would make it somehow more transparent, whereas I believe the opposite. Can you explain your reasoning?”
This might have come across wrong, and is a potential crux. Conditioning on a particular text-generation model, I would guess that applying RL increases the risk—for example, I would consider Gemini 2.0 Flash Thinking as riskier than Gemini 2.0 Flash. But if you just showed me a bunch of eval results for an unknown model and asked how risky I thought the model was based on those, I would be more concerned about a fully black-box LLM than a RL CoT/scaffolded LM.
“Do you disagree that RL pushes models to be better at planning and exceeding human-level capabilities?”
No, it seems pretty clear that RL models like o3 are more capable than vanilla LLMs. So in a sense, I guess I think RL is bad because it increases capabilities faster, which I think is bad. But I still disagree that RL is worse for any theoretical reason beyond “it works better”.
Tying this all back to your post, there are a few object-level claims that I continue to disagree with, but if I came to agree with you on them I would also change my mind more on the overall discussion. Specifically:
The underlying LLMs have no reason to lie in very strategic ways, because they are not trained to plan ahead. (In particular, I don’t understand why you think this is true for what you call open-ended RL, but not for RLHF which you seem to be okay with.)
Human-level intelligence also ceases to be an important mark, because RL is about solving problems, not mimicking humans. (Again, isn’t this already a problem with RLHF? And/or synthetic data?)
At this point we can no longer trust the chains of thought to represent their true reasoning, because models are now rewarded based on the final results that these chains lead to. Even if you put a constraint requiring the intermediate tokens to appear like logical reasoning, the models may find ways to produce seemingly-logical tokens that encode additional side information useful for the problem they are trying to solve. (I agree with this naively, but think this probelm is a lot more tractable than e.g. interpretability on a 100b parameter transformer.)
Of course, I’m more than open to hearing stronger arguments for these, and would happily change my mind if I saw convincing evidence.
I think we are making good progress on getting to the crux :)
It sounds to me like some of our disagreements may arise from us using terminology differently, so let me try to clarify.
“conditional on AI being at a given capabilities level, I prefer to reach that level with less capable text generators and more advanced RL/scaffolding”
I disagree in 2 ways:
1. It’s wrong to condition on AI being at a given level of capabilities. When deciding which path is better, priority number one should be to not exceed a dangerous level of capabilities. Only then should we ask: given that capability level, which technology stack achieves it in ways that are otherwise safer (more transparent and controllable, less agentic).
2. I’m not exactly sure what you mean by scaffolding, but it sounds like you are conflating RL-based AI with what I call “compositional AI systems” (that’s the term I used in my original post from 20 months ago). In retrospect it’s probably not the best term, but for the sake of consistency let me stick with it.
By “compositional AI systems” I refer to any piece of software that is using an underlying AI model (e.g. an LLM) to achieve much stronger capabilities than those the underlying model would achieve if used on its own as a black box. Common techniques include chain of thought and tool use (e.g. access to web search and other APIs), but in theory you can have infinitely complex systems (e.g. trees of thought). Now, the underlying model can be trained with relatively open-ended RL as in o1 and o3, or it can be a simple LLM that’s only fine-tuned with a simple RLHF or constitutional AI. Whether the system is compositional and whether the underlying model is trained with open-ended RL is two different questions, and I argue that compositional AI systems are relatively safe, as long as you stick with simple LLMs. If you use open-ended RL to train the underlying model, it’s far less safe. I totally agree with you that compositional AI systems are safer than end-to-end opaque models.
What’s the difference between RLHF and open-ended RL?
RLHF is technically a form of RL, but unlike what I call “open-ended RL”, which is intended to make models more capable (as in o1/o3), RLHF doesn’t really increase capabilities in significant ways. It’s intended to just make models more aligned and helpful. It’s generally believed that most of the knowledge of LLMs is learned during pre-training (with pure language modeling) and that RLHF mostly just encourages the models to use their knowledge and capabilities to help users (as opposed to just predict the most likely next token you’d find for a similar prompt on the internet). What I’m worried about is RL that’s intended to increase capabilities by rewarding models that manage to solve complex problems. That’s what I call open-ended RL.
Open-ended RL encourages models to plan ahead and be strategic, because these are useful skills for solving complex problems, but not so useful for producing a chunk of text that a human user would like (as in RLHF). That’s also why I’m worried about strategic deception being much more likely to arise with open-ended RL.
Likewise, I appreciate your willingness to explain your argument and the opportunity to explain mine has forced me to reflect more on what I believe, which is certainly a plus.
On the first point, I think the statement “priority number one should be to not exceed a dangerous level of capabilities” requires some nuance. For example, if an AI lab pretrainined a model that had a dangerous level of capabilities, and then introduced some robust safety features that prevented the use of those capabilities, I think that’s acceptable. In fact, this is already happening—GPT-4 by default is useful for building weapons, but typically refuses to reveal that knowledge to users (yes, jailbreaking etc. exists but the current RLHF is pretty good). So I would agree with an amended statement that “priority number one should be to not exceed a dangerous level of capabilities without correspondingly robust safety features.” I also think that generic statements like “we shouldn’t build dangerous AI” are somewhat akin to statements like “we should reduce homelessness”—if you have a specific intervention or policy in mind, I would be happy to discuss it!
I think your second point is basically our crux. It seems (correct me if I’m wrong!) that the way you are seeing RL models is something like training LLMs with RL enhances their capabilities and makes them more agentic, and this is bad because more capable models are bad. I don’t disagree with this at face value, but I continue to think it misses an important capabilities/safety tradeoff. My view is more like training LLMs with RL enhances their capabilities and makes them more agentic, but is net positive because it incentivizes the development of easier-to-align CoT systems over harder-to-align base LLMs. Perhaps our crux is around this tradeoff—are CoT systems sufficiently easier to align that they are safer to develop despite being more capable and more agentic?
My answer is yes. To make sure we’re on the same page,[1] I’m thinking about a CoT system with two parts, an internal reasoning model and an external output model. The reasoning model is initially trained primarily for helpfulness and honesty, with RL to incentivize more correct answers on math, science, historical reasoning, etc. Importantly, there is no safety reward in the initial training, to prevent deceptive alignment, steganography, etc. Therefore this reasoning model’s output is potentially harmful but also faithful—it actually reflects how the model solved a problem. Then the output model is trained to convert the CoT into the user-facing output, and to detect safety issues, etc. (There may also be safety-relevant post-training in the reasoning model, e.g. https://openai.com/index/deliberative-alignment/)
Without a doubt, the internal reasoning model is more capable and more agentic—as you wrote, incentivizing over a long sequence of tokens rather than a few at a time makes models less safe and more agentic, which I agree is bad. And there are still problems—ensuring that the CoT is faithful, properly implementing safety specs in the output model, etc. However, aligning this style of system is a much more tractable task than aligning a base LLM, and it is heartening to me that these systems might work sufficiently well with smaller LLMs that nobody is ever incentivized to build Claude 6 or whatever. Do you disagree?
On a meta note, I’m an ML researcher and so have some implicit assumptions about how these are constructed because of the specific models I work on. This could be making us both confused, so I’m trying to be more explicit about the systems I’m envisioning. For what it’s worth, though, I think I would have the some opinion even if I was purely an outside observer.
I think our crux is that, for a reason I don’t fully understand yet, you equate “CoT systems”[1] with RL training. Again, I fully agree with you that systems based on CoT (or other hard-coded software and prompt engineering over weaker underlying models) are much safer than black-box end-to-end models. But why do you assume that this requires open-ended RL[2]? Wouldn’t you agree that it would be safer with simple LLMs that haven’t undergone open-ended RL? I also like your idea of a CoT system with two parts, but again, I’d argue that you don’t need open-ended RL for that.
Other points I somewhat disagree with, but I don’t think are the core of the crux, so we probably shouldn’t delve into:
I still think that priority number one should be to not exceed a dangerous level of capabilities, even with robust safety features.
I find both “we shouldn’t build dangerous AI” and “we should reduce homelessness” to be useful statements. To come up with good plans, we first need to know what we’re aiming for.
I used the term “compositional AI systems”, but for the sake of ease of communication, let’s use your terminology.
See my definition above for open-ended RL.
I agree that this might be our crux, so I’ll try to briefly explain my side. My view is still more or less training LLMs with RL enhances their capabilities and makes them more agentic, but is net positive because it incentivizes the development of easier-to-align CoT systems over harder-to-align base LLMs. I think this is only true with open-ended RL because:
Regular CoT/prompt engineering is effective, but not so effective that I expect it to meaningfully change the incentive landscape for creating base models. For example, when people figured out that CoT improved benchmarks for GPT-3.5, I don’t think that disincentivized the development of GPT-4. In contrast, I do think the creation of the o-series models (with open-ended RL) is actively disincentivizing the development of GPT-5, which I see as a good thing.
Open-ended RL might make compositional AI safer, not less safe. Done right, it discourages models from learning to reason strongly in the forward pass, which is imo the most dangerous capability.
Again, I agree that you don’t need open-ended RL for CoT systems, but if you aren’t using RL on the entire output then you need a more capable forward pass, and this seems bad. In effect, your options are:
Build a regular model and then do CoT during inference (e.g. via prompt engineering)
Build a model and reward it based on CoT during training with RL
Option 1 creates much more capable forward passes, Option 2 does not. I think we have a much better shot at aligninf models built the second way.
Before I attempt to respond to your objections, I want to first make sure that I understand your reasoning.
I think you’re saying that in theory it would be better to have CoT systems based on pure LLMs, but you don’t expect these to be powerful enough without open-ended RL, so this approach won’t be incentivized and die out from competition against AI labs who do use open-ended RL. Is it a faithful summary of (part of) your view?
You are also saying that if done right, open-ended RL discourages models from learning to reason strongly in the forward pass. Can you explain what you mean exactly and why you think that?
I think you are also saying that models trained with open-ended RL are easier to align than pure LLMs. Is it because you expect them to be overall more capable (and therefore easier to do anything with, including alignment), or for another reason?
In case it helps to clarify our crux, I’d like to add that I agree with you that AI systems without open-ended RL would likely be much weaker than those with it, so I’m definitely expecting incentives to push more and more AI labs to use this technique. I just wish we could somehow push against these incentives. Pure LLMs producing weaker AI systems is in my opinion a feature, not a bug. I think our society would benefit from slower progress in frontier AGI.
These are close but not quite the claims I believe.
I do think that CoT systems based on pure LLMs will never be that good at problem-solving because a webtext-trained assistant just isn’t that good at working with long chains of reasoning. I think any highly capable CoT system will require at least some RL (or be pre-trained on synthetic data from another CoT system that was trained with RL, but I’m not sure it makes a difference here). I’m a little less confident about whether pure LLMs will be disincentivized—for example, labs might stop developing CoT systems if inference-time compute requirements are too expensive—but I think labs will generally move more resources toward CoT systems.
I think the second two points are best explained with an example, which might clarify how I’m approaching the question.
Suppose I make two LLMs, GPT large (with more parameters) and GPT small (with fewer). I pre-train them on webtext and then I want to teach them how to do modular addition, so I create a bunch of synthetic data of input-output pairs like {6 + 2 mod 5, 3} and finetune the LLMs with the synthetic data to output a single answer, using the difference between their output and the answer as a loss function. GPT large becomes very good at this task, and GPT small does not.
So I create a new dataset of input-output pairs like {Solve 6 + 2 mod 5 step-by-step, writing out your reasoning at each step. Plan your approach ahead of time and periodically reflect on your work., 3}. I train GPT small on this dataset, but when it gets the answer right I reward the entire chain of thought, not just the token with the answer. This approach incentivizes GPT small to use a CoT to solve the problem, and now it performs as well as GPT large did with regular finetuning.[1]
In the end, I have two equally capable (at modular arithmetic) systems—GPT large, which was trained only with finetuning, and GPT small, which was trained with finetuning + open-ended RL. I have a few claims here:
GPT small’s CoT is likely to reflect how it’s actually solving the problem. It couldn’t do the problems pre-RL, so we know it isn’t just solving them internally and backfilling a plausible explanation. We can prevent steganography by doing things like periodically paraphrasing the CoT or translating it between languages. We can also verify this by altering the CoT to plausible but incorrect explanations and ensuring that task performance is degraded.
For this reason, GPT small is much more interpretable, since we can look at the CoT to understand how it solved the problem. GPT large, on the other hand, is still a complete black box—we don’t know how it’s solving the problem. When we finetuned it, GPT large learned how to do these problems in a single forward pass, making it incredibly hard to understand its reasoning.
And for this reason, GPT small is also easier to align. We can monitor the CoT to make sure it’s actually doing modular arithmetic. In contrast, GPT large might be doing something that locally approximates modular arithmetic but behaves unpredictably outside the training distribution. In fact, if we deploy GPT small in out-of-distribution contexts (such as inputting negative numbers), the CoT will likely provide useful information about how it plans to adapt and what its problem-solving approach will be.
I am much more excited about building systems like GPT small than I am about building systems like GPT large. Do you disagree (or disagree about any subpoints, or about this example’s generality?)
P.S. I am enjoying this discussion, I feel that you’ve been very reasonable and I continue to be open to changing my mind about this if I see convincing evidence :)
Oversimplified obviously, but details shouldn’t matter here
Why does RL necessarily mean that AIs are trained to plan ahead?
I explain it in more detail in my original post.
In short, in standard language modeling the model only tries to predict the most likely immediate next token (T1), and then the most likely token after that (T2) given T1, and so on; whereas in RL it’s trying to optimize a whole sequence of next tokens (T1, …, Tn) such that the rewards for all the tokens (up to Tn) are taken into account in the reward of the immediate next token (T1).