This is really cool! CoT is very powerful for improving capabilities, and it might even improve interpretability by externalizing reasoning or reduce specification game by allowing us to supervise process instead of outcomes. Because it seems like a method that might be around for a while, it’s important to surface problems early on. Showing that the conclusions of CoT don’t always match the explanations is an important problem for both capabilities-style research using CoT to solve math problems, as well as safety agendas that hope to monitor or provide feedback on the CoT.
Just wanted to drop a link to this new paper that tries to improve CoT performance by identifying and fixing inconsistencies in the reasoning process. This wouldn’t necessarily solve the problem in your paper, unless the critic managed to identify that the generator was biased by your suggestion in the original prompt. Still, interesting stuff!
One question for you: Faithful CoT would improve interpretability by allowing us to read off a model‘s thought process. But it’s also likely to improve general capabilities: so far, CoT has been one of the most effective ways to improve mathematical and logical reasoning in LLMs. Do you think improving general reasoning skills of LLMs is dangerous? And if so, then imagine your work inspires more people to work on faithfulness in CoT. Do you expect any acceleration of general capabilities to be outweighed by the benefits to interpretability? For example, if someone wants to misuse an LLM to cause harm, faithful CoT would improve their ability to do so, and the greater ease of interpretability would be little consolation. On balance, do you think it’s worth working on, and what subproblems are particularly valuable from a safety angle? (Trying to make this question tough because I think the issue is tricky and multifaceted, not because I have any gotcha answer.)
CoT is already incredibly hot, I don’t think we’re adding to the hype. If anything, I’d be more concerned if anyone came away thinking that CoT was a dead-end, because I think doing CoT might be a positive development for safety (as opposed to people trying to get models to do reasoning without externalizing it).
Improving the faithfulness of CoT explanations doesn’t mean improving the reasoning abilities of LLMs. It just means making the reasoning process more consistent and predictable. The faithfulness of CoT is fairly distinct from the performance that you get from CoT. Any work on improving faithfulness would be really helpful for safety and has minimal capability externalities.
This is really cool! CoT is very powerful for improving capabilities, and it might even improve interpretability by externalizing reasoning or reduce specification game by allowing us to supervise process instead of outcomes. Because it seems like a method that might be around for a while, it’s important to surface problems early on. Showing that the conclusions of CoT don’t always match the explanations is an important problem for both capabilities-style research using CoT to solve math problems, as well as safety agendas that hope to monitor or provide feedback on the CoT.
Just wanted to drop a link to this new paper that tries to improve CoT performance by identifying and fixing inconsistencies in the reasoning process. This wouldn’t necessarily solve the problem in your paper, unless the critic managed to identify that the generator was biased by your suggestion in the original prompt. Still, interesting stuff!
One question for you: Faithful CoT would improve interpretability by allowing us to read off a model‘s thought process. But it’s also likely to improve general capabilities: so far, CoT has been one of the most effective ways to improve mathematical and logical reasoning in LLMs. Do you think improving general reasoning skills of LLMs is dangerous? And if so, then imagine your work inspires more people to work on faithfulness in CoT. Do you expect any acceleration of general capabilities to be outweighed by the benefits to interpretability? For example, if someone wants to misuse an LLM to cause harm, faithful CoT would improve their ability to do so, and the greater ease of interpretability would be little consolation. On balance, do you think it’s worth working on, and what subproblems are particularly valuable from a safety angle? (Trying to make this question tough because I think the issue is tricky and multifaceted, not because I have any gotcha answer.)
Thanks! Glad you like it. A few thoughts:
CoT is already incredibly hot, I don’t think we’re adding to the hype. If anything, I’d be more concerned if anyone came away thinking that CoT was a dead-end, because I think doing CoT might be a positive development for safety (as opposed to people trying to get models to do reasoning without externalizing it).
Improving the faithfulness of CoT explanations doesn’t mean improving the reasoning abilities of LLMs. It just means making the reasoning process more consistent and predictable. The faithfulness of CoT is fairly distinct from the performance that you get from CoT. Any work on improving faithfulness would be really helpful for safety and has minimal capability externalities.