I think we’re working with a different set of premises, so I’ll try to disentangle a few ideas.
First, I completely agree with you that building superhuman AGI carries a lot of risks, and that society broadly isn’t prepared for the advent of AI models that can perform economically useful labor.
Unfortunately, economic and political incentives being what they are, capabilities research will continue to happen. My more specific claim is thatconditional on AI being at a given capabilities level, I prefer to reach that level with less capable text generators and more advanced RL/scaffolding (e.g. o1) as opposed to simply more capable text generators (e.g. GPT-4). I believe that the former lends itself to stronger oversight mechanisms, more tractable interpretability techniques, and more robust safety interventions during real-world deployment.
“It sounds like you believe that training a model with RL would make it somehow more transparent, whereas I believe the opposite. Can you explain your reasoning?”
This might have come across wrong, and is a potential crux. Conditioning on a particular text-generation model, I would guess that applying RL increases the risk—for example, I would consider Gemini 2.0 Flash Thinking as riskier than Gemini 2.0 Flash. But if you just showed me a bunch of eval results for an unknown model and asked how risky I thought the model was based on those, I would be more concerned about a fully black-box LLM than a RL CoT/scaffolded LM.
“Do you disagree that RL pushes models to be better at planning and exceeding human-level capabilities?”
No, it seems pretty clear that RL models like o3 are more capable than vanilla LLMs. So in a sense, I guess I think RL is bad because it increases capabilities faster, which I think is bad. But I still disagree that RL is worse for any theoretical reason beyond “it works better”.
Tying this all back to your post, there are a few object-level claims that I continue to disagree with, but if I came to agree with you on them I would also change my mind more on the overall discussion. Specifically:
The underlying LLMs have no reason to lie in very strategic ways, because they are not trained to plan ahead. (In particular, I don’t understand why you think this is true for what you call open-ended RL, but not for RLHF which you seem to be okay with.)
Human-level intelligence also ceases to be an important mark, because RL is about solving problems, not mimicking humans. (Again, isn’t this already a problem with RLHF? And/or synthetic data?)
At this point we can no longer trust the chains of thought to represent their true reasoning, because models are now rewarded based on the final results that these chains lead to. Even if you put a constraint requiring the intermediate tokens to appear like logical reasoning, the models may find ways to produce seemingly-logical tokens that encode additional side information useful for the problem they are trying to solve. (I agree with this naively, but think this probelm is a lot more tractable than e.g. interpretability on a 100b parameter transformer.)
Of course, I’m more than open to hearing stronger arguments for these, and would happily change my mind if I saw convincing evidence.
I think we are making good progress on getting to the crux :)
It sounds to me like some of our disagreements may arise from us using terminology differently, so let me try to clarify.
“conditional on AI being at a given capabilities level, I prefer to reach that level with less capable text generators and more advanced RL/scaffolding”
I disagree in 2 ways:
1. It’s wrong to condition on AI being at a given level of capabilities. When deciding which path is better, priority number one should be to not exceed a dangerous level of capabilities. Only then should we ask: given that capability level, which technology stack achieves it in ways that are otherwise safer (more transparent and controllable, less agentic).
2. I’m not exactly sure what you mean by scaffolding, but it sounds like you are conflating RL-based AI with what I call “compositional AI systems” (that’s the term I used in my original post from 20 months ago). In retrospect it’s probably not the best term, but for the sake of consistency let me stick with it.
By “compositional AI systems” I refer to any piece of software that is using an underlying AI model (e.g. an LLM) to achieve much stronger capabilities than those the underlying model would achieve if used on its own as a black box. Common techniques include chain of thought and tool use (e.g. access to web search and other APIs), but in theory you can have infinitely complex systems (e.g. trees of thought). Now, the underlying model can be trained with relatively open-ended RL as in o1 and o3, or it can be a simple LLM that’s only fine-tuned with a simple RLHF or constitutional AI. Whether the system is compositional and whether the underlying model is trained with open-ended RL is two different questions, and I argue that compositional AI systems are relatively safe, as long as you stick with simple LLMs. If you use open-ended RL to train the underlying model, it’s far less safe. I totally agree with you that compositional AI systems are safer than end-to-end opaque models.
What’s the difference between RLHF and open-ended RL?
RLHF is technically a form of RL, but unlike what I call “open-ended RL”, which is intended to make models more capable (as in o1/o3), RLHF doesn’t really increase capabilities in significant ways. It’s intended to just make models more aligned and helpful. It’s generally believed that most of the knowledge of LLMs is learned during pre-training (with pure language modeling) and that RLHF mostly just encourages the models to use their knowledge and capabilities to help users (as opposed to just predict the most likely next token you’d find for a similar prompt on the internet). What I’m worried about is RL that’s intended to increase capabilities by rewarding models that manage to solve complex problems. That’s what I call open-ended RL.
Open-ended RL encourages models to plan ahead and be strategic, because these are useful skills for solving complex problems, but not so useful for producing a chunk of text that a human user would like (as in RLHF). That’s also why I’m worried about strategic deception being much more likely to arise with open-ended RL.
Likewise, I appreciate your willingness to explain your argument and the opportunity to explain mine has forced me to reflect more on what I believe, which is certainly a plus.
On the first point, I think the statement “priority number one should be to not exceed a dangerous level of capabilities” requires some nuance. For example, if an AI lab pretrainined a model that had a dangerous level of capabilities, and then introduced some robust safety features that prevented the use of those capabilities, I think that’s acceptable. In fact, this is already happening—GPT-4 by default is useful for building weapons, but typically refuses to reveal that knowledge to users (yes, jailbreaking etc. exists but the current RLHF is pretty good). So I would agree with an amended statement that “priority number one should be to not exceed a dangerous level of capabilities without correspondingly robust safety features.” I also think that generic statements like “we shouldn’t build dangerous AI” are somewhat akin to statements like “we should reduce homelessness”—if you have a specific intervention or policy in mind, I would be happy to discuss it!
I think your second point is basically our crux. It seems (correct me if I’m wrong!) that the way you are seeing RL models is something like training LLMs with RL enhances their capabilities and makes them more agentic, and this is bad because more capable models are bad. I don’t disagree with this at face value, but I continue to think it misses an important capabilities/safety tradeoff. My view is more like training LLMs with RL enhances their capabilities and makes them more agentic, but is net positive because it incentivizes the development of easier-to-align CoT systems over harder-to-align base LLMs. Perhaps our crux is around this tradeoff—are CoT systems sufficiently easier to align that they are safer to develop despite being more capable and more agentic?
My answer is yes. To make sure we’re on the same page,[1] I’m thinking about a CoT system with two parts, an internal reasoning model and an external output model. The reasoning model is initially trained primarily for helpfulness and honesty, with RL to incentivize more correct answers on math, science, historical reasoning, etc. Importantly, there is no safety reward in the initial training, to prevent deceptive alignment, steganography, etc. Therefore this reasoning model’s output is potentially harmful but also faithful—it actually reflects how the model solved a problem. Then the output model is trained to convert the CoT into the user-facing output, and to detect safety issues, etc. (There may also be safety-relevant post-training in the reasoning model, e.g. https://openai.com/index/deliberative-alignment/)
Without a doubt, the internal reasoning model is more capable and more agentic—as you wrote, incentivizing over a long sequence of tokens rather than a few at a time makes models less safe and more agentic, which I agree is bad. And there are still problems—ensuring that the CoT is faithful, properly implementing safety specs in the output model, etc. However, aligning this style of system is a much more tractable task than aligning a base LLM, and it is heartening to me that these systems might work sufficiently well with smaller LLMs that nobody is ever incentivized to build Claude 6 or whatever. Do you disagree?
On a meta note, I’m an ML researcher and so have some implicit assumptions about how these are constructed because of the specific models I work on. This could be making us both confused, so I’m trying to be more explicit about the systems I’m envisioning. For what it’s worth, though, I think I would have the some opinion even if I was purely an outside observer.
I think our crux is that, for a reason I don’t fully understand yet, you equate “CoT systems”[1] with RL training. Again, I fully agree with you that systems based on CoT (or other hard-coded software and prompt engineering over weaker underlying models) are much safer than black-box end-to-end models. But why do you assume that this requires open-ended RL[2]? Wouldn’t you agree that it would be safer with simple LLMs that haven’t undergone open-ended RL? I also like your idea of a CoT system with two parts, but again, I’d argue that you don’t need open-ended RL for that.
Other points I somewhat disagree with, but I don’t think are the core of the crux, so we probably shouldn’t delve into:
I still think that priority number one should be to not exceed a dangerous level of capabilities, even with robust safety features.
I find both “we shouldn’t build dangerous AI” and “we should reduce homelessness” to be useful statements. To come up with good plans, we first need to know what we’re aiming for.
I agree that this might be our crux, so I’ll try to briefly explain my side. My view is still more or less training LLMs with RL enhances their capabilities and makes them more agentic, but is net positive because it incentivizes the development of easier-to-align CoT systems over harder-to-align base LLMs. I think this is only true with open-ended RL because:
Regular CoT/prompt engineering is effective, but not so effective that I expect it to meaningfully change the incentive landscape for creating base models. For example, when people figured out that CoT improved benchmarks for GPT-3.5, I don’t think that disincentivized the development of GPT-4. In contrast, I do think the creation of the o-series models (with open-ended RL) is actively disincentivizing the development of GPT-5, which I see as a good thing.
Open-ended RL might make compositional AI safer, not less safe. Done right, it discourages models from learning to reason strongly in the forward pass, which is imo the most dangerous capability.
Again, I agree that you don’t need open-ended RL for CoT systems, but if you aren’t using RL on the entire output then you need a more capable forward pass, and this seems bad. In effect, your options are:
Build a regular model and then do CoT during inference (e.g. via prompt engineering)
Build a model and reward it based on CoT during training with RL
Option 1 creates much more capable forward passes, Option 2 does not. I think we have a much better shot at aligninf models built the second way.
Before I attempt to respond to your objections, I want to first make sure that I understand your reasoning.
I think you’re saying that in theory it would be better to have CoT systems based on pure LLMs, but you don’t expect these to be powerful enough without open-ended RL, so this approach won’t be incentivized and die out from competition against AI labs who do use open-ended RL. Is it a faithful summary of (part of) your view?
You are also saying that if done right, open-ended RL discourages models from learning to reason strongly in the forward pass. Can you explain what you mean exactly and why you think that?
I think you are also saying that models trained with open-ended RL are easier to align than pure LLMs. Is it because you expect them to be overall more capable (and therefore easier to do anything with, including alignment), or for another reason?
In case it helps to clarify our crux, I’d like to add that I agree with you that AI systems without open-ended RL would likely be much weaker than those with it, so I’m definitely expecting incentives to push more and more AI labs to use this technique. I just wish we could somehow push against these incentives. Pure LLMs producing weaker AI systems is in my opinion a feature, not a bug. I think our society would benefit from slower progress in frontier AGI.
These are close but not quite the claims I believe.
I do think that CoT systems based on pure LLMs will never be that good at problem-solving because a webtext-trained assistant just isn’t that good at working with long chains of reasoning. I think any highly capable CoT system will require at least some RL (or be pre-trained on synthetic data from another CoT system that was trained with RL, but I’m not sure it makes a difference here). I’m a little less confident about whether pure LLMs will be disincentivized—for example, labs might stop developing CoT systems if inference-time compute requirements are too expensive—but I think labs will generally move more resources toward CoT systems.
I think the second two points are best explained with an example, which might clarify how I’m approaching the question.
Suppose I make two LLMs, GPT large (with more parameters) and GPT small (with fewer). I pre-train them on webtext and then I want to teach them how to do modular addition, so I create a bunch of synthetic data of input-output pairs like {6 + 2 mod 5, 3} and finetune the LLMs with the synthetic data to output a single answer, using the difference between their output and the answer as a loss function. GPT large becomes very good at this task, and GPT small does not.
So I create a new dataset of input-output pairs like {Solve 6 + 2 mod 5 step-by-step, writing out your reasoning at each step. Plan your approach ahead of time and periodically reflect on your work., 3}. I train GPT small on this dataset, but when it gets the answer right I reward the entire chain of thought, not just the token with the answer. This approach incentivizes GPT small to use a CoT to solve the problem, and now it performs as well as GPT large did with regular finetuning.[1]
In the end, I have two equally capable (at modular arithmetic) systems—GPT large, which was trained only with finetuning, and GPT small, which was trained with finetuning + open-ended RL. I have a few claims here:
GPT small’s CoT is likely to reflect how it’s actually solving the problem. It couldn’t do the problems pre-RL, so we know it isn’t just solving them internally and backfilling a plausible explanation. We can prevent steganography by doing things like periodically paraphrasing the CoT or translating it between languages. We can also verify this by altering the CoT to plausible but incorrect explanations and ensuring that task performance is degraded.
For this reason, GPT small is much more interpretable, since we can look at the CoT to understand how it solved the problem. GPT large, on the other hand, is still a complete black box—we don’t know how it’s solving the problem. When we finetuned it, GPT large learned how to do these problems in a single forward pass, making it incredibly hard to understand its reasoning.
And for this reason, GPT small is also easier to align. We can monitor the CoT to make sure it’s actually doing modular arithmetic. In contrast, GPT large might be doing something that locally approximates modular arithmetic but behaves unpredictably outside the training distribution. In fact, if we deploy GPT small in out-of-distribution contexts (such as inputting negative numbers), the CoT will likely provide useful information about how it plans to adapt and what its problem-solving approach will be.
I am much more excited about building systems like GPT small than I am about building systems like GPT large. Do you disagree (or disagree about any subpoints, or about this example’s generality?)
P.S. I am enjoying this discussion, I feel that you’ve been very reasonable and I continue to be open to changing my mind about this if I see convincing evidence :)
I think we’re working with a different set of premises, so I’ll try to disentangle a few ideas.
First, I completely agree with you that building superhuman AGI carries a lot of risks, and that society broadly isn’t prepared for the advent of AI models that can perform economically useful labor.
Unfortunately, economic and political incentives being what they are, capabilities research will continue to happen. My more specific claim is that conditional on AI being at a given capabilities level, I prefer to reach that level with less capable text generators and more advanced RL/scaffolding (e.g. o1) as opposed to simply more capable text generators (e.g. GPT-4). I believe that the former lends itself to stronger oversight mechanisms, more tractable interpretability techniques, and more robust safety interventions during real-world deployment.
“It sounds like you believe that training a model with RL would make it somehow more transparent, whereas I believe the opposite. Can you explain your reasoning?”
This might have come across wrong, and is a potential crux. Conditioning on a particular text-generation model, I would guess that applying RL increases the risk—for example, I would consider Gemini 2.0 Flash Thinking as riskier than Gemini 2.0 Flash. But if you just showed me a bunch of eval results for an unknown model and asked how risky I thought the model was based on those, I would be more concerned about a fully black-box LLM than a RL CoT/scaffolded LM.
“Do you disagree that RL pushes models to be better at planning and exceeding human-level capabilities?”
No, it seems pretty clear that RL models like o3 are more capable than vanilla LLMs. So in a sense, I guess I think RL is bad because it increases capabilities faster, which I think is bad. But I still disagree that RL is worse for any theoretical reason beyond “it works better”.
Tying this all back to your post, there are a few object-level claims that I continue to disagree with, but if I came to agree with you on them I would also change my mind more on the overall discussion. Specifically:
The underlying LLMs have no reason to lie in very strategic ways, because they are not trained to plan ahead. (In particular, I don’t understand why you think this is true for what you call open-ended RL, but not for RLHF which you seem to be okay with.)
Human-level intelligence also ceases to be an important mark, because RL is about solving problems, not mimicking humans. (Again, isn’t this already a problem with RLHF? And/or synthetic data?)
At this point we can no longer trust the chains of thought to represent their true reasoning, because models are now rewarded based on the final results that these chains lead to. Even if you put a constraint requiring the intermediate tokens to appear like logical reasoning, the models may find ways to produce seemingly-logical tokens that encode additional side information useful for the problem they are trying to solve. (I agree with this naively, but think this probelm is a lot more tractable than e.g. interpretability on a 100b parameter transformer.)
Of course, I’m more than open to hearing stronger arguments for these, and would happily change my mind if I saw convincing evidence.
I think we are making good progress on getting to the crux :)
It sounds to me like some of our disagreements may arise from us using terminology differently, so let me try to clarify.
“conditional on AI being at a given capabilities level, I prefer to reach that level with less capable text generators and more advanced RL/scaffolding”
I disagree in 2 ways:
1. It’s wrong to condition on AI being at a given level of capabilities. When deciding which path is better, priority number one should be to not exceed a dangerous level of capabilities. Only then should we ask: given that capability level, which technology stack achieves it in ways that are otherwise safer (more transparent and controllable, less agentic).
2. I’m not exactly sure what you mean by scaffolding, but it sounds like you are conflating RL-based AI with what I call “compositional AI systems” (that’s the term I used in my original post from 20 months ago). In retrospect it’s probably not the best term, but for the sake of consistency let me stick with it.
By “compositional AI systems” I refer to any piece of software that is using an underlying AI model (e.g. an LLM) to achieve much stronger capabilities than those the underlying model would achieve if used on its own as a black box. Common techniques include chain of thought and tool use (e.g. access to web search and other APIs), but in theory you can have infinitely complex systems (e.g. trees of thought). Now, the underlying model can be trained with relatively open-ended RL as in o1 and o3, or it can be a simple LLM that’s only fine-tuned with a simple RLHF or constitutional AI. Whether the system is compositional and whether the underlying model is trained with open-ended RL is two different questions, and I argue that compositional AI systems are relatively safe, as long as you stick with simple LLMs. If you use open-ended RL to train the underlying model, it’s far less safe. I totally agree with you that compositional AI systems are safer than end-to-end opaque models.
What’s the difference between RLHF and open-ended RL?
RLHF is technically a form of RL, but unlike what I call “open-ended RL”, which is intended to make models more capable (as in o1/o3), RLHF doesn’t really increase capabilities in significant ways. It’s intended to just make models more aligned and helpful. It’s generally believed that most of the knowledge of LLMs is learned during pre-training (with pure language modeling) and that RLHF mostly just encourages the models to use their knowledge and capabilities to help users (as opposed to just predict the most likely next token you’d find for a similar prompt on the internet). What I’m worried about is RL that’s intended to increase capabilities by rewarding models that manage to solve complex problems. That’s what I call open-ended RL.
Open-ended RL encourages models to plan ahead and be strategic, because these are useful skills for solving complex problems, but not so useful for producing a chunk of text that a human user would like (as in RLHF). That’s also why I’m worried about strategic deception being much more likely to arise with open-ended RL.
Likewise, I appreciate your willingness to explain your argument and the opportunity to explain mine has forced me to reflect more on what I believe, which is certainly a plus.
On the first point, I think the statement “priority number one should be to not exceed a dangerous level of capabilities” requires some nuance. For example, if an AI lab pretrainined a model that had a dangerous level of capabilities, and then introduced some robust safety features that prevented the use of those capabilities, I think that’s acceptable. In fact, this is already happening—GPT-4 by default is useful for building weapons, but typically refuses to reveal that knowledge to users (yes, jailbreaking etc. exists but the current RLHF is pretty good). So I would agree with an amended statement that “priority number one should be to not exceed a dangerous level of capabilities without correspondingly robust safety features.” I also think that generic statements like “we shouldn’t build dangerous AI” are somewhat akin to statements like “we should reduce homelessness”—if you have a specific intervention or policy in mind, I would be happy to discuss it!
I think your second point is basically our crux. It seems (correct me if I’m wrong!) that the way you are seeing RL models is something like training LLMs with RL enhances their capabilities and makes them more agentic, and this is bad because more capable models are bad. I don’t disagree with this at face value, but I continue to think it misses an important capabilities/safety tradeoff. My view is more like training LLMs with RL enhances their capabilities and makes them more agentic, but is net positive because it incentivizes the development of easier-to-align CoT systems over harder-to-align base LLMs. Perhaps our crux is around this tradeoff—are CoT systems sufficiently easier to align that they are safer to develop despite being more capable and more agentic?
My answer is yes. To make sure we’re on the same page,[1] I’m thinking about a CoT system with two parts, an internal reasoning model and an external output model. The reasoning model is initially trained primarily for helpfulness and honesty, with RL to incentivize more correct answers on math, science, historical reasoning, etc. Importantly, there is no safety reward in the initial training, to prevent deceptive alignment, steganography, etc. Therefore this reasoning model’s output is potentially harmful but also faithful—it actually reflects how the model solved a problem. Then the output model is trained to convert the CoT into the user-facing output, and to detect safety issues, etc. (There may also be safety-relevant post-training in the reasoning model, e.g. https://openai.com/index/deliberative-alignment/)
Without a doubt, the internal reasoning model is more capable and more agentic—as you wrote, incentivizing over a long sequence of tokens rather than a few at a time makes models less safe and more agentic, which I agree is bad. And there are still problems—ensuring that the CoT is faithful, properly implementing safety specs in the output model, etc. However, aligning this style of system is a much more tractable task than aligning a base LLM, and it is heartening to me that these systems might work sufficiently well with smaller LLMs that nobody is ever incentivized to build Claude 6 or whatever. Do you disagree?
On a meta note, I’m an ML researcher and so have some implicit assumptions about how these are constructed because of the specific models I work on. This could be making us both confused, so I’m trying to be more explicit about the systems I’m envisioning. For what it’s worth, though, I think I would have the some opinion even if I was purely an outside observer.
I think our crux is that, for a reason I don’t fully understand yet, you equate “CoT systems”[1] with RL training. Again, I fully agree with you that systems based on CoT (or other hard-coded software and prompt engineering over weaker underlying models) are much safer than black-box end-to-end models. But why do you assume that this requires open-ended RL[2]? Wouldn’t you agree that it would be safer with simple LLMs that haven’t undergone open-ended RL? I also like your idea of a CoT system with two parts, but again, I’d argue that you don’t need open-ended RL for that.
Other points I somewhat disagree with, but I don’t think are the core of the crux, so we probably shouldn’t delve into:
I still think that priority number one should be to not exceed a dangerous level of capabilities, even with robust safety features.
I find both “we shouldn’t build dangerous AI” and “we should reduce homelessness” to be useful statements. To come up with good plans, we first need to know what we’re aiming for.
I used the term “compositional AI systems”, but for the sake of ease of communication, let’s use your terminology.
See my definition above for open-ended RL.
I agree that this might be our crux, so I’ll try to briefly explain my side. My view is still more or less training LLMs with RL enhances their capabilities and makes them more agentic, but is net positive because it incentivizes the development of easier-to-align CoT systems over harder-to-align base LLMs. I think this is only true with open-ended RL because:
Regular CoT/prompt engineering is effective, but not so effective that I expect it to meaningfully change the incentive landscape for creating base models. For example, when people figured out that CoT improved benchmarks for GPT-3.5, I don’t think that disincentivized the development of GPT-4. In contrast, I do think the creation of the o-series models (with open-ended RL) is actively disincentivizing the development of GPT-5, which I see as a good thing.
Open-ended RL might make compositional AI safer, not less safe. Done right, it discourages models from learning to reason strongly in the forward pass, which is imo the most dangerous capability.
Again, I agree that you don’t need open-ended RL for CoT systems, but if you aren’t using RL on the entire output then you need a more capable forward pass, and this seems bad. In effect, your options are:
Build a regular model and then do CoT during inference (e.g. via prompt engineering)
Build a model and reward it based on CoT during training with RL
Option 1 creates much more capable forward passes, Option 2 does not. I think we have a much better shot at aligninf models built the second way.
Before I attempt to respond to your objections, I want to first make sure that I understand your reasoning.
I think you’re saying that in theory it would be better to have CoT systems based on pure LLMs, but you don’t expect these to be powerful enough without open-ended RL, so this approach won’t be incentivized and die out from competition against AI labs who do use open-ended RL. Is it a faithful summary of (part of) your view?
You are also saying that if done right, open-ended RL discourages models from learning to reason strongly in the forward pass. Can you explain what you mean exactly and why you think that?
I think you are also saying that models trained with open-ended RL are easier to align than pure LLMs. Is it because you expect them to be overall more capable (and therefore easier to do anything with, including alignment), or for another reason?
In case it helps to clarify our crux, I’d like to add that I agree with you that AI systems without open-ended RL would likely be much weaker than those with it, so I’m definitely expecting incentives to push more and more AI labs to use this technique. I just wish we could somehow push against these incentives. Pure LLMs producing weaker AI systems is in my opinion a feature, not a bug. I think our society would benefit from slower progress in frontier AGI.
These are close but not quite the claims I believe.
I do think that CoT systems based on pure LLMs will never be that good at problem-solving because a webtext-trained assistant just isn’t that good at working with long chains of reasoning. I think any highly capable CoT system will require at least some RL (or be pre-trained on synthetic data from another CoT system that was trained with RL, but I’m not sure it makes a difference here). I’m a little less confident about whether pure LLMs will be disincentivized—for example, labs might stop developing CoT systems if inference-time compute requirements are too expensive—but I think labs will generally move more resources toward CoT systems.
I think the second two points are best explained with an example, which might clarify how I’m approaching the question.
Suppose I make two LLMs, GPT large (with more parameters) and GPT small (with fewer). I pre-train them on webtext and then I want to teach them how to do modular addition, so I create a bunch of synthetic data of input-output pairs like {6 + 2 mod 5, 3} and finetune the LLMs with the synthetic data to output a single answer, using the difference between their output and the answer as a loss function. GPT large becomes very good at this task, and GPT small does not.
So I create a new dataset of input-output pairs like {Solve 6 + 2 mod 5 step-by-step, writing out your reasoning at each step. Plan your approach ahead of time and periodically reflect on your work., 3}. I train GPT small on this dataset, but when it gets the answer right I reward the entire chain of thought, not just the token with the answer. This approach incentivizes GPT small to use a CoT to solve the problem, and now it performs as well as GPT large did with regular finetuning.[1]
In the end, I have two equally capable (at modular arithmetic) systems—GPT large, which was trained only with finetuning, and GPT small, which was trained with finetuning + open-ended RL. I have a few claims here:
GPT small’s CoT is likely to reflect how it’s actually solving the problem. It couldn’t do the problems pre-RL, so we know it isn’t just solving them internally and backfilling a plausible explanation. We can prevent steganography by doing things like periodically paraphrasing the CoT or translating it between languages. We can also verify this by altering the CoT to plausible but incorrect explanations and ensuring that task performance is degraded.
For this reason, GPT small is much more interpretable, since we can look at the CoT to understand how it solved the problem. GPT large, on the other hand, is still a complete black box—we don’t know how it’s solving the problem. When we finetuned it, GPT large learned how to do these problems in a single forward pass, making it incredibly hard to understand its reasoning.
And for this reason, GPT small is also easier to align. We can monitor the CoT to make sure it’s actually doing modular arithmetic. In contrast, GPT large might be doing something that locally approximates modular arithmetic but behaves unpredictably outside the training distribution. In fact, if we deploy GPT small in out-of-distribution contexts (such as inputting negative numbers), the CoT will likely provide useful information about how it plans to adapt and what its problem-solving approach will be.
I am much more excited about building systems like GPT small than I am about building systems like GPT large. Do you disagree (or disagree about any subpoints, or about this example’s generality?)
P.S. I am enjoying this discussion, I feel that you’ve been very reasonable and I continue to be open to changing my mind about this if I see convincing evidence :)
Oversimplified obviously, but details shouldn’t matter here