I disagree. I think the current approach, with chain-of-thought reasoning, is a marked improvement over naive language modelling in terms of alignment difficulty. CoT allows us to elicit higher capabilities out of the same level base text generation model, meaning less of the computation is done inside the black box and more is done in human-readable tokens. While this still (obviously) has risks, it seems preferable to models that fully internalize the reasoning process. Do you agree with that?
I’m all in for CoT! But when you RL a model to produce CoTs that are better at solving difficult ptoblems, you risk encouraging i) CoTs that carry side information that’s only known to the model, ii) superhuman capabilities, and iii) planning ahead and agency in ways that are difficult to anticipate.
Whether you produce the CoT with a pure LLM or a model that’s also undergone RL, you end up with a seriew of tokens where each token was produced by a fully internalized reasoning process. The only difference is that RL encotages this reasoning process to be more powerful, more agentic, and less predictable. What’s the adventage of RL for safety?
“you risk encouraging i) CoTs that carry side information that’s only known to the model”
This is true by default, but not intractable. For example, you can train the CoT model with periodic paraphrasing to avoid steganography, or you can train a CoT model just for capabilities and introduce a separate model that reasons about safety. Daniel Kokotajlo has some nice writeups about these ideas, he calls it the Shoggoth/Face framework.
“superhuman capabilities”
Agreed that this would be bad, but condition on this happening, better to do it with RL CoT over blackbox token generation.
“planning ahead and agency in ways that are difficult to anticipate”
Not sure why this would be the case—shouldn’t having access to the model’s thought process make this easier to anticipate than if the long-term plans were stored in neuralese across a bunch of transformer layers?
“RL encotages this reasoning process to be more powerful, more agentic, and less predictable”
This is something I agree with in the sense that our frontier models are trained with RL, and those models are also the most powerful and most agentic (since they’re more capable), but I’m not convinced that this is inherent to RL training, and I’m not exactly sure in what way these models are less predictable.
Why are you conditioning on superhuman AGI emerging? I think it’s something very dangerous that our society isn’t ready for. We should pursue a path where we can enjoy as many of the benefits of sub-human-level AGI (of the kind we already have) without risking uncontrolled acceleration. Pushing for stronger capabilities with open-ended RL is counterproductive for the very scenario we need to avoid.
It sounds like you believe that training a model with RL would make it somehow more transparent, whereas I believe the opposite. Can you explain your reasoning?
Do you disagree that RL pushes models to be better at planning and exceeding human-level capabilities?
I like the idea of discouraging steganography, but I still worry that given strong enough incentives, RL-trained models will find ways around this.
I think we’re working with a different set of premises, so I’ll try to disentangle a few ideas.
First, I completely agree with you that building superhuman AGI carries a lot of risks, and that society broadly isn’t prepared for the advent of AI models that can perform economically useful labor.
Unfortunately, economic and political incentives being what they are, capabilities research will continue to happen. My more specific claim is thatconditional on AI being at a given capabilities level, I prefer to reach that level with less capable text generators and more advanced RL/scaffolding (e.g. o1) as opposed to simply more capable text generators (e.g. GPT-4). I believe that the former lends itself to stronger oversight mechanisms, more tractable interpretability techniques, and more robust safety interventions during real-world deployment.
“It sounds like you believe that training a model with RL would make it somehow more transparent, whereas I believe the opposite. Can you explain your reasoning?”
This might have come across wrong, and is a potential crux. Conditioning on a particular text-generation model, I would guess that applying RL increases the risk—for example, I would consider Gemini 2.0 Flash Thinking as riskier than Gemini 2.0 Flash. But if you just showed me a bunch of eval results for an unknown model and asked how risky I thought the model was based on those, I would be more concerned about a fully black-box LLM than a RL CoT/scaffolded LM.
“Do you disagree that RL pushes models to be better at planning and exceeding human-level capabilities?”
No, it seems pretty clear that RL models like o3 are more capable than vanilla LLMs. So in a sense, I guess I think RL is bad because it increases capabilities faster, which I think is bad. But I still disagree that RL is worse for any theoretical reason beyond “it works better”.
Tying this all back to your post, there are a few object-level claims that I continue to disagree with, but if I came to agree with you on them I would also change my mind more on the overall discussion. Specifically:
The underlying LLMs have no reason to lie in very strategic ways, because they are not trained to plan ahead. (In particular, I don’t understand why you think this is true for what you call open-ended RL, but not for RLHF which you seem to be okay with.)
Human-level intelligence also ceases to be an important mark, because RL is about solving problems, not mimicking humans. (Again, isn’t this already a problem with RLHF? And/or synthetic data?)
At this point we can no longer trust the chains of thought to represent their true reasoning, because models are now rewarded based on the final results that these chains lead to. Even if you put a constraint requiring the intermediate tokens to appear like logical reasoning, the models may find ways to produce seemingly-logical tokens that encode additional side information useful for the problem they are trying to solve. (I agree with this naively, but think this probelm is a lot more tractable than e.g. interpretability on a 100b parameter transformer.)
Of course, I’m more than open to hearing stronger arguments for these, and would happily change my mind if I saw convincing evidence.
I disagree. I think the current approach, with chain-of-thought reasoning, is a marked improvement over naive language modelling in terms of alignment difficulty. CoT allows us to elicit higher capabilities out of the same level base text generation model, meaning less of the computation is done inside the black box and more is done in human-readable tokens. While this still (obviously) has risks, it seems preferable to models that fully internalize the reasoning process. Do you agree with that?
I’m all in for CoT! But when you RL a model to produce CoTs that are better at solving difficult ptoblems, you risk encouraging i) CoTs that carry side information that’s only known to the model, ii) superhuman capabilities, and iii) planning ahead and agency in ways that are difficult to anticipate. Whether you produce the CoT with a pure LLM or a model that’s also undergone RL, you end up with a seriew of tokens where each token was produced by a fully internalized reasoning process. The only difference is that RL encotages this reasoning process to be more powerful, more agentic, and less predictable. What’s the adventage of RL for safety?
“you risk encouraging i) CoTs that carry side information that’s only known to the model”
This is true by default, but not intractable. For example, you can train the CoT model with periodic paraphrasing to avoid steganography, or you can train a CoT model just for capabilities and introduce a separate model that reasons about safety. Daniel Kokotajlo has some nice writeups about these ideas, he calls it the Shoggoth/Face framework.
“superhuman capabilities”
Agreed that this would be bad, but condition on this happening, better to do it with RL CoT over blackbox token generation.
“planning ahead and agency in ways that are difficult to anticipate”
Not sure why this would be the case—shouldn’t having access to the model’s thought process make this easier to anticipate than if the long-term plans were stored in neuralese across a bunch of transformer layers?
“RL encotages this reasoning process to be more powerful, more agentic, and less predictable”
This is something I agree with in the sense that our frontier models are trained with RL, and those models are also the most powerful and most agentic (since they’re more capable), but I’m not convinced that this is inherent to RL training, and I’m not exactly sure in what way these models are less predictable.
Why are you conditioning on superhuman AGI emerging? I think it’s something very dangerous that our society isn’t ready for. We should pursue a path where we can enjoy as many of the benefits of sub-human-level AGI (of the kind we already have) without risking uncontrolled acceleration. Pushing for stronger capabilities with open-ended RL is counterproductive for the very scenario we need to avoid.
It sounds like you believe that training a model with RL would make it somehow more transparent, whereas I believe the opposite. Can you explain your reasoning?
Do you disagree that RL pushes models to be better at planning and exceeding human-level capabilities?
I like the idea of discouraging steganography, but I still worry that given strong enough incentives, RL-trained models will find ways around this.
I think we’re working with a different set of premises, so I’ll try to disentangle a few ideas.
First, I completely agree with you that building superhuman AGI carries a lot of risks, and that society broadly isn’t prepared for the advent of AI models that can perform economically useful labor.
Unfortunately, economic and political incentives being what they are, capabilities research will continue to happen. My more specific claim is that conditional on AI being at a given capabilities level, I prefer to reach that level with less capable text generators and more advanced RL/scaffolding (e.g. o1) as opposed to simply more capable text generators (e.g. GPT-4). I believe that the former lends itself to stronger oversight mechanisms, more tractable interpretability techniques, and more robust safety interventions during real-world deployment.
“It sounds like you believe that training a model with RL would make it somehow more transparent, whereas I believe the opposite. Can you explain your reasoning?”
This might have come across wrong, and is a potential crux. Conditioning on a particular text-generation model, I would guess that applying RL increases the risk—for example, I would consider Gemini 2.0 Flash Thinking as riskier than Gemini 2.0 Flash. But if you just showed me a bunch of eval results for an unknown model and asked how risky I thought the model was based on those, I would be more concerned about a fully black-box LLM than a RL CoT/scaffolded LM.
“Do you disagree that RL pushes models to be better at planning and exceeding human-level capabilities?”
No, it seems pretty clear that RL models like o3 are more capable than vanilla LLMs. So in a sense, I guess I think RL is bad because it increases capabilities faster, which I think is bad. But I still disagree that RL is worse for any theoretical reason beyond “it works better”.
Tying this all back to your post, there are a few object-level claims that I continue to disagree with, but if I came to agree with you on them I would also change my mind more on the overall discussion. Specifically:
The underlying LLMs have no reason to lie in very strategic ways, because they are not trained to plan ahead. (In particular, I don’t understand why you think this is true for what you call open-ended RL, but not for RLHF which you seem to be okay with.)
Human-level intelligence also ceases to be an important mark, because RL is about solving problems, not mimicking humans. (Again, isn’t this already a problem with RLHF? And/or synthetic data?)
At this point we can no longer trust the chains of thought to represent their true reasoning, because models are now rewarded based on the final results that these chains lead to. Even if you put a constraint requiring the intermediate tokens to appear like logical reasoning, the models may find ways to produce seemingly-logical tokens that encode additional side information useful for the problem they are trying to solve. (I agree with this naively, but think this probelm is a lot more tractable than e.g. interpretability on a 100b parameter transformer.)
Of course, I’m more than open to hearing stronger arguments for these, and would happily change my mind if I saw convincing evidence.