I think this isn’t quite the right framing, for similar reasons to how symbolic reasoning was all the rage in AI for decades despite basically being a dead end outside of simple domains. We see the CoT use language skillfully, like how a human would use it, and it’s intuitive to think that the language is how/where the reasoning is being done, that “less of the computation is done inside the black box and more is done in human-readable tokens.” But that’s not strictly accurate—the human-readable CoT is more like a projection of the underlying reasoning—which is still as black-box as ever—into the low-dimensional token space.
I’m reminded of the recent paper about fine-tuning Claude to be ‘bad’, and how one of the the things that happened to their CoT model was that the training caused it to give bullshit explanations in the CoT about why it was a good idea to do the bad thing this time, and then do the bad thing. If you didn’t know a priori that the thing you were finetuning for was bad—e.g. if your dataset incentivized deceiving the human but you weren’t aware it had that vulnerability—there’s no law that says your CoT has to alert you that the model is being bad, the projection of the perturbation to the latent reasoning process into token space will plausibly just look to you like “it says some bullshit and then does the bad thing.”
I agree that CoT faithfulness isn’t something we should assume occurs by default (although it seems like it does to some extent). My claim is that CoT faithfulness is a tractable problem and that people have already made meaningful steps toward guaranteeing faithfulness.
I totally agree. I think it’s another framing for why open-ended RL is much more dangerous than pure LLMs. Models trained with open-ended RL are rewarded based on their final results, and will produce any series of tokens that help with that. They are incentivized to produce CoTs that do not logically represent their true thinking.
Pure LLMs, on the other hand, have no desire other than making the next token as likely as possible, given whatever character they are currently simulating. Whether they are “good” or “bad”, I can’t see anything in their training that would incentivize them to develop an efficient secret language to communicate with their future selves in a way that produces misleading CoTs.
Can’t CoT’s be what makes RL safe, however? (if you force the reasoner to self-limit under some recursion depth when it senses that the RL agent might be asking for so much that it makes it unsafe)
I think this isn’t quite the right framing, for similar reasons to how symbolic reasoning was all the rage in AI for decades despite basically being a dead end outside of simple domains. We see the CoT use language skillfully, like how a human would use it, and it’s intuitive to think that the language is how/where the reasoning is being done, that “less of the computation is done inside the black box and more is done in human-readable tokens.” But that’s not strictly accurate—the human-readable CoT is more like a projection of the underlying reasoning—which is still as black-box as ever—into the low-dimensional token space.
I’m reminded of the recent paper about fine-tuning Claude to be ‘bad’, and how one of the the things that happened to their CoT model was that the training caused it to give bullshit explanations in the CoT about why it was a good idea to do the bad thing this time, and then do the bad thing. If you didn’t know a priori that the thing you were finetuning for was bad—e.g. if your dataset incentivized deceiving the human but you weren’t aware it had that vulnerability—there’s no law that says your CoT has to alert you that the model is being bad, the projection of the perturbation to the latent reasoning process into token space will plausibly just look to you like “it says some bullshit and then does the bad thing.”
I agree that CoT faithfulness isn’t something we should assume occurs by default (although it seems like it does to some extent). My claim is that CoT faithfulness is a tractable problem and that people have already made meaningful steps toward guaranteeing faithfulness.
Happy to discuss this further, but have you read e.g. https://www.lesswrong.com/posts/HQyWGE2BummDCc2Cx/the-case-for-cot-unfaithfulness-is-overstated
I totally agree. I think it’s another framing for why open-ended RL is much more dangerous than pure LLMs. Models trained with open-ended RL are rewarded based on their final results, and will produce any series of tokens that help with that. They are incentivized to produce CoTs that do not logically represent their true thinking.
Pure LLMs, on the other hand, have no desire other than making the next token as likely as possible, given whatever character they are currently simulating. Whether they are “good” or “bad”, I can’t see anything in their training that would incentivize them to develop an efficient secret language to communicate with their future selves in a way that produces misleading CoTs.
Can’t CoT’s be what makes RL safe, however? (if you force the reasoner to self-limit under some recursion depth when it senses that the RL agent might be asking for so much that it makes it unsafe)