Before I attempt to respond to your objections, I want to first make sure that I understand your reasoning.
I think you’re saying that in theory it would be better to have CoT systems based on pure LLMs, but you don’t expect these to be powerful enough without open-ended RL, so this approach won’t be incentivized and die out from competition against AI labs who do use open-ended RL. Is it a faithful summary of (part of) your view?
You are also saying that if done right, open-ended RL discourages models from learning to reason strongly in the forward pass. Can you explain what you mean exactly and why you think that?
I think you are also saying that models trained with open-ended RL are easier to align than pure LLMs. Is it because you expect them to be overall more capable (and therefore easier to do anything with, including alignment), or for another reason?
In case it helps to clarify our crux, I’d like to add that I agree with you that AI systems without open-ended RL would likely be much weaker than those with it, so I’m definitely expecting incentives to push more and more AI labs to use this technique. I just wish we could somehow push against these incentives. Pure LLMs producing weaker AI systems is in my opinion a feature, not a bug. I think our society would benefit from slower progress in frontier AGI.
I totally agree. I think it’s another framing for why open-ended RL is much more dangerous than pure LLMs. Models trained with open-ended RL are rewarded based on their final results, and will produce any series of tokens that help with that. They are incentivized to produce CoTs that do not logically represent their true thinking.
Pure LLMs, on the other hand, have no desire other than making the next token as likely as possible, given whatever character they are currently simulating. Whether they are “good” or “bad”, I can’t see anything in their training that would incentivize them to develop an efficient secret language to communicate with their future selves in a way that produces misleading CoTs.