It seems like many problems we’ll train models to solve with RL won’t be solvable in a single forward pass. E.g., consider a math proof that takes 20 lines to write out, and perhaps also requires some intermediate reasoning to figure out the next line. Do you expect vestigial reasoning to appear for such problems as well?
I’m not sure I understand why I should expect long CoTs to persist in the process-supervised but not in the outcome-supervised case. I agree that writing about deleting the tests is salient in the latter but not in the former case, but writing a vague phrase followed by deleting the tests is salient in the former case and leads to the same outcome. In the process-supervised case, the causal chain is attempt to solve the problem → write a vague phrase → delete the tests, and in the outcome-supervised case, it’s attempt to solve the problem → write about deleting the tests → delete the tests. Why do you expect that it’s easier for the model to stumble upon the strategy of skipping the first step in the latter chain?
Yeah, I didn’t mention this explicitly, but I think this is also likely to happen! It could look something like “the model can do steps 1-5, 6-10, 11-15, and 16-20 in one forward pass each, but it still writes out 20 steps.” Presumably most of the tasks we use reasoning models for will be too complex to do in a single forward pass.
Good point! My thinking is that the model may have a bias for the CoT to start with some kind of obvious “planning” behavior rather than just a vague phrase. Either planning to delete the tests or (futilely) planning to fix the actual problem meets this need. Alternatively, it’s possible that the two training runs resulted in two different kinds of CoT by random chance.
Great post! Some questions:
It seems like many problems we’ll train models to solve with RL won’t be solvable in a single forward pass. E.g., consider a math proof that takes 20 lines to write out, and perhaps also requires some intermediate reasoning to figure out the next line. Do you expect vestigial reasoning to appear for such problems as well?
I’m not sure I understand why I should expect long CoTs to persist in the process-supervised but not in the outcome-supervised case. I agree that writing about deleting the tests is salient in the latter but not in the former case, but writing a vague phrase followed by deleting the tests is salient in the former case and leads to the same outcome. In the process-supervised case, the causal chain is attempt to solve the problem → write a vague phrase → delete the tests, and in the outcome-supervised case, it’s attempt to solve the problem → write about deleting the tests → delete the tests. Why do you expect that it’s easier for the model to stumble upon the strategy of skipping the first step in the latter chain?
Yeah, I didn’t mention this explicitly, but I think this is also likely to happen! It could look something like “the model can do steps 1-5, 6-10, 11-15, and 16-20 in one forward pass each, but it still writes out 20 steps.” Presumably most of the tasks we use reasoning models for will be too complex to do in a single forward pass.
Good point! My thinking is that the model may have a bias for the CoT to start with some kind of obvious “planning” behavior rather than just a vague phrase. Either planning to delete the tests or (futilely) planning to fix the actual problem meets this need. Alternatively, it’s possible that the two training runs resulted in two different kinds of CoT by random chance.