There seems to be papers that show that if you naively train on chain of thought, you train models not to verbalize potentially problematic reasoning in their chain of thought. I however don’t see discussion about how to train chain of thought models to better verbalize their reasoning.
If you can easily train a model to hide it’s reasoning you should also be able to train models the other way around to be more explicit about their reasoning.
One approach I imagine is to take a query like diagnosing medical issues and replace key words that change the output and then see how well the chain of thought reflects that change. If the chain of thought tells you something about the change in outcome, you reinforce the chain of thought. If the chain of thought doesn’t reflect the outcome well, you punish the chain of thought.
There seems to be papers that show that if you naively train on chain of thought, you train models not to verbalize potentially problematic reasoning in their chain of thought. I however don’t see discussion about how to train chain of thought models to better verbalize their reasoning.
If you can easily train a model to hide it’s reasoning you should also be able to train models the other way around to be more explicit about their reasoning.
One approach I imagine is to take a query like diagnosing medical issues and replace key words that change the output and then see how well the chain of thought reflects that change. If the chain of thought tells you something about the change in outcome, you reinforce the chain of thought. If the chain of thought doesn’t reflect the outcome well, you punish the chain of thought.